Abstract
Emotion recognition is one of the most complex research areas as individuals express emotional cues based on several modalities such as audio, facial expressions, and language. The recognition of emotion from one of the modalities is not always feasible as the single modalities are disturbed by several factors. The existing models cannot attain the maximum accuracy in exactly identifying the expressions of individuals. In this paper, a novel hybrid multi-modal emotion recognition framework InceptionV3DenseNet is proposed for improving the recognition accuracy. Initially contextual features are extracted from different modalities such as video, audio and text. From the video modality, the features such as shot length, lighting key, motion and color are extracted. Zero-crossing rate, Mel frequency cepstral coefficient (MFCC), energy and pitch are extracted from the audio modality and the unigram, bigram and TF-IDF are extracted from the textual modality. In feature extraction, high level features are extracted with better generalization capability. The extracted features are fused using the multi-set integrated canonical correlation analysis (MICCA) and are provided as the input to the proposed hybrid network model. It detects the correlation between multimodal features to provide better performance with single learning phase. Then the proposed hybrid deep learning model is utilized to classify emotional states by considering the accuracy and reliability. The work simulations are conducted in the MATLAB platform and evaluated using the MELD and RAVDESS datasets. The outcomes proved that the proposed model is more efficient and accurate than the compared models and attained an overall accuracy rate of 74.87% in MELD and 95.25% in RAVDESS.
| Original language | English |
|---|---|
| Pages (from-to) | 40375-40402 |
| Number of pages | 28 |
| Journal | Multimedia Tools and Applications |
| Volume | 82 |
| Issue number | 26 |
| DOIs | |
| State | Published - Nov 2023 |
| Externally published | Yes |
Bibliographical note
Publisher Copyright:© 2023, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.
Keywords
- Audio features
- Classification
- Deep learning
- Feature extraction
- Feature fusion
- Multi-modal emotion recognition
- Textual features
- Video features
ASJC Scopus subject areas
- Software
- Media Technology
- Hardware and Architecture
- Computer Networks and Communications