Abstract
Continuous sign language recognition (CSLR) aims to recognize and interpret sequences of sign language gestures in videos. Currently, most CSLR frameworks combine spatial feature extractors based on convolutional neural networks (CNNs) with temporal convolutional networks (TCNs) for sequence learning. However, CNN-based spatial feature extractors apply the same convolutional kernel uniformly across all regions of an image, which limits their capacity to extract complex details such as fingers and facial features, which are essential for CSLR. In addition, sign languages include signs of varying lengths that cannot be accurately modeled using a fixed-size TCN. To address these issues, we present the Swin multiscale temporal perception (Swin-MSTP) framework, in which the Swin Transformer is utilized as the spatial feature extractor, capable of capturing fine spatial details and providing a stronger contextual understanding between sign language elements in video frames. The Swin Transformer was integrated with the MSTP module to extract time-wise features. Experimental results show that our single-modality system outperformed existing methods on the CSL dataset, including multimodal frameworks. The model also achieved competitive performance on the Phoenix2014, Phoenix2014-T, and CSL-Daily datasets. The code is available at https://github.com/snalyami/Swin-MSTP.
Original language | English |
---|---|
Article number | 129015 |
Journal | Neurocomputing |
Volume | 617 |
DOIs | |
State | Published - 7 Feb 2025 |
Bibliographical note
Publisher Copyright:© 2024 Elsevier B.V.
Keywords
- Continuous sign language recognition
- Gesture recognition
- Multiscale TCN
- Sign language recognition
- Sign language translation
- Swin transformer
ASJC Scopus subject areas
- Computer Science Applications
- Cognitive Neuroscience
- Artificial Intelligence