Hybrid multi-modal emotion recognition framework based on InceptionV3DenseNet

  • Fakir Mashuque Alamgir*
  • , Md Shafiul Alam
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

13 Scopus citations

Abstract

Emotion recognition is one of the most complex research areas as individuals express emotional cues based on several modalities such as audio, facial expressions, and language. The recognition of emotion from one of the modalities is not always feasible as the single modalities are disturbed by several factors. The existing models cannot attain the maximum accuracy in exactly identifying the expressions of individuals. In this paper, a novel hybrid multi-modal emotion recognition framework InceptionV3DenseNet is proposed for improving the recognition accuracy. Initially contextual features are extracted from different modalities such as video, audio and text. From the video modality, the features such as shot length, lighting key, motion and color are extracted. Zero-crossing rate, Mel frequency cepstral coefficient (MFCC), energy and pitch are extracted from the audio modality and the unigram, bigram and TF-IDF are extracted from the textual modality. In feature extraction, high level features are extracted with better generalization capability. The extracted features are fused using the multi-set integrated canonical correlation analysis (MICCA) and are provided as the input to the proposed hybrid network model. It detects the correlation between multimodal features to provide better performance with single learning phase. Then the proposed hybrid deep learning model is utilized to classify emotional states by considering the accuracy and reliability. The work simulations are conducted in the MATLAB platform and evaluated using the MELD and RAVDESS datasets. The outcomes proved that the proposed model is more efficient and accurate than the compared models and attained an overall accuracy rate of 74.87% in MELD and 95.25% in RAVDESS.

Original languageEnglish
Pages (from-to)40375-40402
Number of pages28
JournalMultimedia Tools and Applications
Volume82
Issue number26
DOIs
StatePublished - Nov 2023
Externally publishedYes

Bibliographical note

Publisher Copyright:
© 2023, The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.

Keywords

  • Audio features
  • Classification
  • Deep learning
  • Feature extraction
  • Feature fusion
  • Multi-modal emotion recognition
  • Textual features
  • Video features

ASJC Scopus subject areas

  • Software
  • Media Technology
  • Hardware and Architecture
  • Computer Networks and Communications

Fingerprint

Dive into the research topics of 'Hybrid multi-modal emotion recognition framework based on InceptionV3DenseNet'. Together they form a unique fingerprint.

Cite this