Skip to main navigation Skip to search Skip to main content

Integrating multi-platform gene expression data and machine learning assisted biomarker discovery in colorectal cancer

Research output: Contribution to journalArticlepeer-review

Abstract

Public repositories host a wealth of gene expression datasets, most of which come from microarray platforms. More recent studies are increasingly using high-throughput RNA sequencing (RNA-Seq) for better specificity and sensitivity. This study proposes an innovative approach that combines diverse gene expression data from multiple colorectal cancer (CRC) datasets generated using high-throughput sequencing and microarray technologies. The data integration increases the statistical power and increases the biological meaning of our findings. We employed least absolute shrinkage and selection operator (LASSO) regression for feature selection on the combined dataset to reduce the dimension of the data and retain only robust gene signatures associated with colorectal cancer. The chosen features were subjected to functional enrichment analysis. The LASSOselected features served as an input to multiple classifiers. We then applied 5 machine learning and 2 deep learning models to identify the most effective genes present across all seven different classification algorithms. Parameters such as F1 score, accuracy, sensitivity, and specificity were used to assess the model’s performance. The models were evaluated on an external dataset obtained from the the cancer genome atlas (TCGA) database. Random forest and one-dimensional convolutional neural networks (1D-CNNs) were found to be the most effective models, achieving the highest accuracies. Each model also demonstrated greater than 90% accuracy when tested on the TCGA dataset. Finally, we identified carbonic anhydrase 7 (CA7), ATP binding cassette subfamily A member 8 (ABCA8), somatostatin (SST), myomesin 1 (MYOM1), CC motif chemokine ligand 23 (CCL23), procollagen C-endopeptidase enhancer 2 (PCOLCE2), and CXC Motif chemokine ligand 10 (CXCL10) genes as potential prognostic biomarkers of CRC. This study presents a data integration and machine learning approach for finding biomarkers in CRC. The identified gene panel shows promise as a diagnostic tool and needs further validation in clinical settings.

Original languageEnglish
Article number12172025
JournalJournal of King Saud University - Science
Volume38
Issue number3
DOIs
StatePublished - Mar 2026

Bibliographical note

Publisher Copyright:
© 2026 Journal of King Saud University – Science.

Keywords

  • Bioinformatics
  • Colorectal cancer
  • Gene expression
  • Machine learning
  • Predictive biomarker

ASJC Scopus subject areas

  • General

Fingerprint

Dive into the research topics of 'Integrating multi-platform gene expression data and machine learning assisted biomarker discovery in colorectal cancer'. Together they form a unique fingerprint.

Cite this