Recognition of splice-junction genetic sequences using random forest and Bayesian optimization

Abdel Karim Baareh, Alaa Elsayad*, Mujahed Al-Dhaifallah

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

6 Scopus citations

Abstract

Recently, Bayesian Optimization (BO) provides an efficient technique for selecting the hyperparameters of machine learning models. The BO strategy maintains a surrogate model and an acquisition function to efficiently optimize the computation-intensive functions with a few iterations. In this paper, we demonstrate the utility of the BO to fine-tune the hyperparameters of a Random Forest (RF) model for a problem related to the recognition of splice-junction genetic sequences. Locating these splice-junctions prompts further understanding of the DNA splicing process. Specifically, the BO algorithm optimizes four RF hyperparameters: number of trees, number of splitting features, splitting criterion, and leaf size. The optimized RF model automatically selects the most predictive features of the training data. The dataset is obtained from the UCI machine learning repository where half of the records represent two different types of splice-junctions and the other half does not represent any splice-junction. Experimental results proved the advantage of the BO-RF with 99.96% and 97.34% training and test classification accuracies respectively. The results also demonstrated the ability of the RF model to select the most important features, ensuring the best possible results using Support Vector Machine (SVM), K-Nearest Neighbor (KNN), and decision tree (DT) models. Some practical procedures in model development and evaluation such as out-of-bag error and cross-validation approaches are also referred to.

Original languageEnglish
Pages (from-to)30505-30522
Number of pages18
JournalMultimedia Tools and Applications
Volume80
Issue number20
DOIs
StatePublished - Aug 2021

Bibliographical note

Publisher Copyright:
© 2021, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.

Keywords

  • Bayesian optimization, feature selection, support vector machine
  • Decision tree
  • K-nearest neighbor
  • Random forest
  • Splice junction recognition

ASJC Scopus subject areas

  • Software
  • Media Technology
  • Hardware and Architecture
  • Computer Networks and Communications

Fingerprint

Dive into the research topics of 'Recognition of splice-junction genetic sequences using random forest and Bayesian optimization'. Together they form a unique fingerprint.

Cite this