CoRT: Transformer-based code representations with self-supervision by predicting reserved words for code smell detection

Amal Alazba*, Hamoud Aljamaan, Mohammad Alshayeb

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

4 Scopus citations

Abstract

Context: Code smell detection is the process of identifying poorly designed and implemented code pieces. Machine learning-based approaches require enormous amounts of manually labeled data, which are costly and difficult to scale. Unsupervised semantic feature learning, or learning without manual annotation, is vital for effectively harvesting an enormous amount of available data. Objective: The objective of this study is to propose a new code smell detection approach that utilizes self-supervised learning to learn intermediate representations without the need for labels and then fine-tune these representations on multiple tasks. Method: We propose a Code Representation with Transformers (CoRT) to learn the semantic and structural features of the source code by training transformers to recognize masked reserved words that are applied to the code given as input. We empirically demonstrated that the defined proxy task provides a powerful method for learning semantic and structural features. We exhaustively evaluated our approach on four downstream tasks: detection of the Data Class, God Class, Feature Envy, and Long Method code smells. Moreover, we compare our results with those of two paradigms: supervised learning and a feature-based approach. Finally, we conducted a cross-project experiment to evaluate the generalizability of our method to unseen labeled data. Results: The results indicate that the proposed method has a high detection performance for code smells. For instance, the detection performance of CoRT on Data Class achieved a score of F1 between 88.08–99.4, Area Under Curve (AUC) between 89.62–99.88, and Matthews Correlation Coefficient (MCC) between 75.28–98.8, while God Class achieved a value of F1 ranges from 86.32–99.03, AUC of 92.1–99.85, and MCC of 76.15–98.09. Compared with the baseline model and feature-based approach, CoRT achieved better detection performance and had a high capability to detect code smells in unseen datasets. Conclusions: The proposed method has been shown to be effective in detecting class-level, and method-level code smells.

Original languageEnglish
Article number59
JournalEmpirical Software Engineering
Volume29
Issue number3
DOIs
StatePublished - May 2024

Bibliographical note

Publisher Copyright:
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2024.

Keywords

  • Bad smell detection
  • Deep learning
  • Self-supervised learning

ASJC Scopus subject areas

  • Software

Fingerprint

Dive into the research topics of 'CoRT: Transformer-based code representations with self-supervision by predicting reserved words for code smell detection'. Together they form a unique fingerprint.

Cite this