SMILE-VLM: Self-Supervised Multi-View Representation Learning Using Vision-Language Model for 3D/4D Facial Expression Recognition

Research output: Contribution to journalArticlepeer-review

Abstract

Facial expression recognition (FER) is a fundamental task in affective computing with applications in human-computer interaction, mental health analysis, and behavioral understanding. In this paper, we propose SMILE-VLM, a self-supervised vision-language model for 3D/4D FER that unifies multiview visual representation learning with natural language supervision. SMILE-VLM learns robust, semantically aligned, and view-invariant embeddings by proposing three core components: multiview decorrelation via a Barlow Twins-style loss, vision-language contrastive alignment, and cross-modal redundancy minimization. Our framework achieves the state-of-the-art performance on multiple benchmarks. We further extend SMILE-VLM to the task of 4D micro-expression recognition (MER) to recognize the subtle affective cues. The extensive results demonstrate that SMILE-VLM not only surpasses existing unsupervised methods but also matches or exceeds supervised baselines, offering a scalable and annotation-efficient solution for expressive facial behavior understanding.

Original languageEnglish
Pages (from-to)143831-143842
Number of pages12
JournalIEEE Access
Volume13
DOIs
StatePublished - 2025

Bibliographical note

Publisher Copyright:
© 2025 The Authors.

Keywords

  • 3D/4D point-clouds
  • Artificial intelligence
  • computer vision
  • emotion recognition
  • facial expression recognition
  • vision-language models (VLMs)

ASJC Scopus subject areas

  • General Computer Science
  • General Materials Science
  • General Engineering

Fingerprint

Dive into the research topics of 'SMILE-VLM: Self-Supervised Multi-View Representation Learning Using Vision-Language Model for 3D/4D Facial Expression Recognition'. Together they form a unique fingerprint.

Cite this