Lightweight Hybrid CNN-Vision Transformer for Real-Time Automated Shipping Container Damage Detection

Research output: Contribution to journalArticlepeer-review

Abstract

Manual container inspections often lead to inconsistencies and inefficiencies, which can disrupt supply chains and increase operational costs. The time-consuming nature of manual checks makes automation an appealing alternative. This paper presents a lightweight hybrid model combining Convolutional Neural Networks (CNN) and Vision Transformers (ViT), specifically designed for automated container damage classification. The CNN extracts fine-grained local features, while the ViT models global structural patterns, overcoming the limitations of purely convolutional architectures. We evaluate four model variants on a dataset of 2,116 images, collected from container depots near Jakarta Port. Our proposed CNN-ViT hybrid model generalized well with this dataset and achieves 96.57% ± 0.83 accuracy, 0.089 ± 0.015 binary cross-entropy loss, and 64.21 ± 1.47 ms inference latency, peaking at 97.2% accuracy and 62 ms latency in the best trial with only 1 million parameters. Compared to MobileNetV2, our approach improves classification accuracy by about 1% while reducing inference time by approximately 9 ms, demonstrating its efficiency for real-time automated container inspection in resource-constrained environments.

Original languageEnglish
Pages (from-to)537-544
Number of pages8
JournalFME Transactions
Volume53
Issue number4
DOIs
StatePublished - 2025

Bibliographical note

Publisher Copyright:
© Faculty of Mechanical Engineering, Belgrade. All rights reserved

Keywords

  • CNN
  • ViT
  • binary classification
  • computer vision
  • container inspection
  • lightweight models

ASJC Scopus subject areas

  • Mechanics of Materials
  • Mechanical Engineering

Fingerprint

Dive into the research topics of 'Lightweight Hybrid CNN-Vision Transformer for Real-Time Automated Shipping Container Damage Detection'. Together they form a unique fingerprint.

Cite this