Abstract
Facial expression recognition (FER) in 3D and 4D domains presents a significant challenge in affective computing due to the complexity of spatial and temporal facial dynamics. Its success is crucial for advancing applications in human behavior understanding, healthcare monitoring, and human-computer interaction. In this work, we propose FACET–VLM, a vision–language framework for 3D/4D FER that integrates multiview facial representation learning with semantic guidance from natural language prompts. FACET–VLM introduces three key components: Cross-View Semantic Aggregation (CVSA) for view-consistent fusion, Multiview Text-Guided Fusion (MTGF) for semantically aligned facial emotions, and a multiview consistency loss to enforce structural coherence across views. Our model achieves state-of-the-art accuracy across multiple benchmarks, including BU-3DFE, Bosphorus, BU-4DFE, and BP4D-Spontaneous. We further extend FACET–VLM to 4D micro-expression recognition (MER) on the 4DME dataset, demonstrating strong performance in capturing subtle, short-lived emotional cues. FACET–VLM achieves up to 99.41 % accuracy on BU-4DFE and outperforms prior methods by margins as high as 15.12 % in cross-dataset evaluation on BP4D. The extensive experimental results confirm the effectiveness and substantial contributions of each individual component within the framework. Overall, FACET–VLM offers a robust, extensible, and high-performing solution for multimodal FER in both posed and spontaneous settings.
| Original language | English |
|---|---|
| Article number | 131621 |
| Journal | Neurocomputing |
| Volume | 657 |
| DOIs | |
| State | Published - 7 Dec 2025 |
Bibliographical note
Publisher Copyright:© 2025 Elsevier B.V.
Keywords
- Artificial intelligence
- Computer vision
- Emotion recognition
- Facial expression recognition
- Point-clouds
- Vision-language models (VLMs)
ASJC Scopus subject areas
- Computer Science Applications
- Cognitive Neuroscience
- Artificial Intelligence
Fingerprint
Dive into the research topics of 'FACET–VLM: Facial emotion learning with text-guided multiview fusion via vision-language model for 3D/4D facial expression recognition'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver