Abstract
Satellite imagery offers rich information for land cover classification, but choosing an effective yet efficient feature extractor or backbone architecture remains challenging. In this study, I benchmark 25 vision-transformers across 10 public land cover datasets to guide backbone selection for downstream classification tasks. The proposed approach encodes each satellite image into a fixed-length feature vector via a pre-trained transformer, then trains and tests a linear support-vector classifier on these encodings to isolate the impact of the backbone alone. I report average classification accuracy and F1-score over three random stratified splits per dataset, and I also measure training time to assess the computational cost. Results show that the image encoding performed using large-receptive-field transformers with advanced self-attention—particularly deit3_base_patch16_224 and twins_svt_large—achieve the highest accuracies without incurring prohibitive training times. In contrast, encodings of the compact variants achieve faster training but incur notable performance drops around 7%–8%. These findings reveal a clear trade-off between representational power and efficiency. Practitioners can leverage such rankings to select a transformer backbone that best balances accuracy and computational efficiency for satellite image-based land cover classification tasks, accelerating the development of robust and resource-aware systems.
| Original language | English |
|---|---|
| Article number | e70082 |
| Journal | Expert Systems |
| Volume | 42 |
| Issue number | 7 |
| DOIs | |
| State | Published - Jul 2025 |
Bibliographical note
Publisher Copyright:© 2025 John Wiley & Sons Ltd.
Keywords
- deep learning
- image-based land cover classification
- support vector machine
- vision transformers
ASJC Scopus subject areas
- Control and Systems Engineering
- Theoretical Computer Science
- Computational Theory and Mathematics
- Artificial Intelligence