Abstract
Context: Code smells indicate poor software design, affecting maintainability. Accurate detection is vital for refactoring and quality improvement. However, existing datasets often frame detection as single-label classification, limiting realism. Objective: This paper develops a multi-label dataset for code smell detection, integrating textual features and numerical metrics from open-source Java projects. Method: We collected code from 103 Java projects, parsed it into Abstract Syntax Trees (ASTs), extracted features, and annotated samples based on prior studies. Data cleaning, unification, and merging techniques were applied to support four code smells: God Class, Data Class, Feature Envy, and Long Method. Results: The dataset comprises 107,554 samples with multi-label annotations, improving detection realism. Evaluation shows F1 scores of 95.89% (Data Class), 94.48% (God Class), 88.68% (Feature Envy), and 88.87% (Long Method). Conclusion: This dataset aids advanced studies on code smell detection, particularly for fine-tuning LLMs. Future work can expand it to other languages and additional smells, enhancing diversity and applicability.
| Original language | English |
|---|---|
| Article number | 1207 |
| Journal | Scientific data |
| Volume | 12 |
| Issue number | 1 |
| DOIs | |
| State | Published - Dec 2025 |
Bibliographical note
Publisher Copyright:© The Author(s) 2025.
ASJC Scopus subject areas
- Statistics and Probability
- Information Systems
- Education
- Computer Science Applications
- Statistics, Probability and Uncertainty
- Library and Information Sciences