SignVLM: a pre-trained large video model for sign language recognition

Research output: Contribution to journalArticlepeer-review

Abstract

Sign language recognition (SLR) plays a vital role in including people with hearing impairment in the community. It facilitates the recognition of sign gestures and converts them into spoken languages. One of the main challenges for developing SLR systems is the lack of annotated datasets. This issue is more noticeable with low–resourced sign languages. To address this issue, we propose a pretrained large vision model, SignVLM, for SLR. This work explores the capability of the contrastive language–image pre-training (CLIP) model for SLR. This model is used to extract spatial features from the sign video frames while a Transformer decoder is used for temporal learning. The proposed model has been evaluated on four different sign languages using the KArSL, WLASL, LSA64, and AUSTL datasets. Different evaluation settings have been followed in this work including zero-shot and few-shot learning. The proposed model outperformed other models on the KArSL, WLASL, and LSA64 datasets and achieved comparable performance on the AUTSL dataset. The obtained results demonstrate the generalization of the proposed model to new datasets with few samples. The code and data are available at https://github.com/Hamzah-Luqman/signVLM.

Original languageEnglish
Article numbere3112
JournalPeerJ Computer Science
Volume11
DOIs
StatePublished - 2025

Bibliographical note

Publisher Copyright:
© Copyright 2025 Luqman Distributed under Creative Commons CC-BY 4.0. https://creativecommons.org/licenses/by/4.0/

Keywords

  • Algorithms and Analysis of Algorithms
  • Arabic sign language
  • Artificial Intelligence
  • CLIP
  • Computer Vision
  • Large vision models
  • Natural Language and Speech
  • Neural Networks
  • Sign language recognition
  • Sign language translation

ASJC Scopus subject areas

  • General Computer Science

Fingerprint

Dive into the research topics of 'SignVLM: a pre-trained large video model for sign language recognition'. Together they form a unique fingerprint.

Cite this