Data leakage detection in machine learning code: transfer learning, active learning, or low-shot prompting?

  • Nouf Alturayeif*
  • , Jameleddine Hassine
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

With the increasing reliance on machine learning (ML) across diverse disciplines, ML code has been subject to a number of issues that impact its quality, such as lack of documentation, algorithmic biases, overfitting, lack of reproducibility, inadequate data preprocessing, and potential for data leakage, all of which can significantly affect the performance and reliability of ML models. Data leakage can affect the quality of ML models where sensitive information from the test set inadvertently influences the training process, leading to inflated performance metrics that do not generalize well to new, unseen data. Data leakage can occur at either the dataset-level (i.e., during dataset construction) or at the code-level. Existing studies introduced methods to detect code-level data leakage using manual and code analysis approaches. However, automated tools with advanced ML techniques are increasingly recognized as essential for efficiently identifying quality issues in large and complex codebases, enhancing the overall effectiveness of code review processes. In this article, we aim to explore ML-based approaches for limited annotated datasets to detect code-level data leakage in ML code. We proposed three approaches, namely, transfer learning, active learning, and low-shot prompting. Additionally, we introduced an automated approached to handle the imbalance issues of code data. Our results show that active learning outperformed the other approaches with an F-2 score of 0.72 and reduced the number of needed annotated samples from 1,523 to 698. We conclude that existing ML-based approaches can effectively mitigate the challenges associated with limited data availability.

Original languageEnglish
Article numbere2730
JournalPeerJ Computer Science
Volume11
DOIs
StatePublished - 2025

Bibliographical note

Publisher Copyright:
© 2025 Alturayeif and Hassine

Keywords

  • Active learning
  • Code quality
  • Data leakage
  • Low-shot prompting
  • Transfer learning

ASJC Scopus subject areas

  • General Computer Science

Fingerprint

Dive into the research topics of 'Data leakage detection in machine learning code: transfer learning, active learning, or low-shot prompting?'. Together they form a unique fingerprint.

Cite this