FACET–VLM: Facial emotion learning with text-guided multiview fusion via vision-language model for 3D/4D facial expression recognition

Research output: Contribution to journalArticlepeer-review

Abstract

Facial expression recognition (FER) in 3D and 4D domains presents a significant challenge in affective computing due to the complexity of spatial and temporal facial dynamics. Its success is crucial for advancing applications in human behavior understanding, healthcare monitoring, and human-computer interaction. In this work, we propose FACET–VLM, a vision–language framework for 3D/4D FER that integrates multiview facial representation learning with semantic guidance from natural language prompts. FACET–VLM introduces three key components: Cross-View Semantic Aggregation (CVSA) for view-consistent fusion, Multiview Text-Guided Fusion (MTGF) for semantically aligned facial emotions, and a multiview consistency loss to enforce structural coherence across views. Our model achieves state-of-the-art accuracy across multiple benchmarks, including BU-3DFE, Bosphorus, BU-4DFE, and BP4D-Spontaneous. We further extend FACET–VLM to 4D micro-expression recognition (MER) on the 4DME dataset, demonstrating strong performance in capturing subtle, short-lived emotional cues. FACET–VLM achieves up to 99.41 % accuracy on BU-4DFE and outperforms prior methods by margins as high as 15.12 % in cross-dataset evaluation on BP4D. The extensive experimental results confirm the effectiveness and substantial contributions of each individual component within the framework. Overall, FACET–VLM offers a robust, extensible, and high-performing solution for multimodal FER in both posed and spontaneous settings.

Original languageEnglish
Article number131621
JournalNeurocomputing
Volume657
DOIs
StatePublished - 7 Dec 2025

Bibliographical note

Publisher Copyright:
© 2025 Elsevier B.V.

Keywords

  • Artificial intelligence
  • Computer vision
  • Emotion recognition
  • Facial expression recognition
  • Point-clouds
  • Vision-language models (VLMs)

ASJC Scopus subject areas

  • Computer Science Applications
  • Cognitive Neuroscience
  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'FACET–VLM: Facial emotion learning with text-guided multiview fusion via vision-language model for 3D/4D facial expression recognition'. Together they form a unique fingerprint.

Cite this