Resampling Techniques for Materials Informatics: Limitations in Crystal Point Groups Classification

Abdulmohsen A. Alsaui, Yousef A. Alghofaili, Mohammed Alghadeer, Fahhad H. Alharbi*

*Corresponding author for this work

Research output: Contribution to journalReview articlepeer-review

9 Scopus citations

Abstract

Imbalanced data sets in materials informatics are pervasive and pose a challenge to the development of classification models. This work investigates crystal point group prediction as an example of an imbalanced classification problem in materials informatics. Multiple resampling and classification techniques were considered. The findings suggest that the most influential variable of the resampling algorithms is the one controlling the number of samples to omit (undersample) or synthetically generate (oversample), as expected. The effect of balancing is to enhance the classification performance of the minority class at the cost of reducing the correct predictions of the majority class. Moreover, ideal balancing, where the classes are precisely balanced, is not optimum. Alternatively, partial balancing should be performed. In this study, the ideal ratio of the minority to majority class was found to be around two-thirds. The biggest improvement in the classification was for the random undersampling technique with k-nearest neighbors and random forest.

Original languageEnglish
Pages (from-to)3514-3523
Number of pages10
JournalJournal of Chemical Information and Modeling
Volume62
Issue number15
DOIs
StatePublished - 8 Aug 2022

Bibliographical note

Publisher Copyright:
© 2022 American Chemical Society. All rights reserved.

ASJC Scopus subject areas

  • General Chemistry
  • General Chemical Engineering
  • Computer Science Applications
  • Library and Information Sciences

Fingerprint

Dive into the research topics of 'Resampling Techniques for Materials Informatics: Limitations in Crystal Point Groups Classification'. Together they form a unique fingerprint.

Cite this