Swin-MSTP: Swin transformer with multi-scale temporal perception for continuous sign language recognition

Sarah Alyami, Hamzah Luqman*

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

Continuous sign language recognition (CSLR) aims to recognize and interpret sequences of sign language gestures in videos. Currently, most CSLR frameworks combine spatial feature extractors based on convolutional neural networks (CNNs) with temporal convolutional networks (TCNs) for sequence learning. However, CNN-based spatial feature extractors apply the same convolutional kernel uniformly across all regions of an image, which limits their capacity to extract complex details such as fingers and facial features, which are essential for CSLR. In addition, sign languages include signs of varying lengths that cannot be accurately modeled using a fixed-size TCN. To address these issues, we present the Swin multiscale temporal perception (Swin-MSTP) framework, in which the Swin Transformer is utilized as the spatial feature extractor, capable of capturing fine spatial details and providing a stronger contextual understanding between sign language elements in video frames. The Swin Transformer was integrated with the MSTP module to extract time-wise features. Experimental results show that our single-modality system outperformed existing methods on the CSL dataset, including multimodal frameworks. The model also achieved competitive performance on the Phoenix2014, Phoenix2014-T, and CSL-Daily datasets. The code is available at https://github.com/snalyami/Swin-MSTP.

Original languageEnglish
Article number129015
JournalNeurocomputing
Volume617
DOIs
StatePublished - 7 Feb 2025

Bibliographical note

Publisher Copyright:
© 2024 Elsevier B.V.

Keywords

  • Continuous sign language recognition
  • Gesture recognition
  • Multiscale TCN
  • Sign language recognition
  • Sign language translation
  • Swin transformer

ASJC Scopus subject areas

  • Computer Science Applications
  • Cognitive Neuroscience
  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'Swin-MSTP: Swin transformer with multi-scale temporal perception for continuous sign language recognition'. Together they form a unique fingerprint.

Cite this