Abstract
The performance of machine learning (ML)-based malware detection systems strongly depends on the availability of balanced and well-labeled datasets. However, the scarcity of benign Portable Executable (PE) files often leads to class imbalance, compromising classifier performance. This paper presents a novel PE File Synthesizer (PEFS) framework that employs Generative Adversarial Networks (GANs) to generate synthetic benign data for augmenting malware detection datasets. The proposed method extracts structured PE header features from real benign files and trains a Wasserstein GAN to produce 9,000 synthetic feature vectors. These vectors are then reconstructed into structurally valid but non-executable PE files, designed solely for static analysis and classifier training. The generated samples are validated using sandbox analysis (Cuckoo) and VirusTotal to ensure benign characteristics. When integrated into the training data, these synthesized instances improve classifier performance, particularly in detecting benign samples, as demonstrated using Random Forest metrics such as precision, recall, and F1-score. This work highlights the viability of GAN-based data augmentation for enhancing PE datasets while ensuring safety and privacy through non-executable synthetic file generation.
| Original language | English |
|---|---|
| Article number | 36 |
| Journal | Journal of Computer Virology and Hacking Techniques |
| Volume | 22 |
| Issue number | 1 |
| DOIs | |
| State | Published - Dec 2026 |
Bibliographical note
Publisher Copyright:© The Author(s), under exclusive licence to Springer-Verlag France SAS, part of Springer Nature 2026.
Keywords
- Generative Adversarial Networks (GANs)
- Machine learning in cybersecurity
- Malware detection
- Portable Executable (PE) Files
- Synthetic data generation
ASJC Scopus subject areas
- Computer Science (miscellaneous)
- Software
- Hardware and Architecture
- Computational Theory and Mathematics
Fingerprint
Dive into the research topics of 'GAN-based PE file synthesizer for balanced malware detection datasets: safe generation of synthetic benign executables'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver