Abstract
Manual container inspections often lead to inconsistencies and inefficiencies, which can disrupt supply chains and increase operational costs. The time-consuming nature of manual checks makes automation an appealing alternative. This paper presents a lightweight hybrid model combining Convolutional Neural Networks (CNN) and Vision Transformers (ViT), specifically designed for automated container damage classification. The CNN extracts fine-grained local features, while the ViT models global structural patterns, overcoming the limitations of purely convolutional architectures. We evaluate four model variants on a dataset of 2,116 images, collected from container depots near Jakarta Port. Our proposed CNN-ViT hybrid model generalized well with this dataset and achieves 96.57% ± 0.83 accuracy, 0.089 ± 0.015 binary cross-entropy loss, and 64.21 ± 1.47 ms inference latency, peaking at 97.2% accuracy and 62 ms latency in the best trial with only 1 million parameters. Compared to MobileNetV2, our approach improves classification accuracy by about 1% while reducing inference time by approximately 9 ms, demonstrating its efficiency for real-time automated container inspection in resource-constrained environments.
| Original language | English |
|---|---|
| Pages (from-to) | 537-544 |
| Number of pages | 8 |
| Journal | FME Transactions |
| Volume | 53 |
| Issue number | 4 |
| DOIs | |
| State | Published - 2025 |
Bibliographical note
Publisher Copyright:© Faculty of Mechanical Engineering, Belgrade. All rights reserved
Keywords
- CNN
- ViT
- binary classification
- computer vision
- container inspection
- lightweight models
ASJC Scopus subject areas
- Mechanics of Materials
- Mechanical Engineering