SmellyCode++: Multi-Label Dataset for Code Smell Detection

Nawaf Alomari*, Amal Alazba, Hamoud Aljamaan, Mohammad Alshayeb

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Context: Code smells indicate poor software design, affecting maintainability. Accurate detection is vital for refactoring and quality improvement. However, existing datasets often frame detection as single-label classification, limiting realism. Objective: This paper develops a multi-label dataset for code smell detection, integrating textual features and numerical metrics from open-source Java projects. Method: We collected code from 103 Java projects, parsed it into Abstract Syntax Trees (ASTs), extracted features, and annotated samples based on prior studies. Data cleaning, unification, and merging techniques were applied to support four code smells: God Class, Data Class, Feature Envy, and Long Method. Results: The dataset comprises 107,554 samples with multi-label annotations, improving detection realism. Evaluation shows F1 scores of 95.89% (Data Class), 94.48% (God Class), 88.68% (Feature Envy), and 88.87% (Long Method). Conclusion: This dataset aids advanced studies on code smell detection, particularly for fine-tuning LLMs. Future work can expand it to other languages and additional smells, enhancing diversity and applicability.

Original languageEnglish
Article number1207
JournalScientific data
Volume12
Issue number1
DOIs
StatePublished - Dec 2025

Bibliographical note

Publisher Copyright:
© The Author(s) 2025.

ASJC Scopus subject areas

  • Statistics and Probability
  • Information Systems
  • Education
  • Computer Science Applications
  • Statistics, Probability and Uncertainty
  • Library and Information Sciences

Fingerprint

Dive into the research topics of 'SmellyCode++: Multi-Label Dataset for Code Smell Detection'. Together they form a unique fingerprint.

Cite this