Speech Emotion Recognition: An Empirical Analysis of Machine Learning Algorithms Across Diverse Data Sets

  • Mostafiz Ahammed*
  • , Rubel Sheikh
  • , Farah Hossain
  • , Shahrima Mustak Liza
  • , Muhammad Arifur Rahman
  • , Mufti Mahmud
  • , David J. Brown
  • *Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Scopus citations

Abstract

Communication is the way of expressing one’s feelings, ideas, and thoughts. Speech is a primary medium for communication. While people communicate with each other in several human interactive applications, such as a call center, entertainment, E-learning between teachers and students, medicine, and communication between clinicians and patients (especially important in the field of psychiatry), it is crucial to identify people’s emotions to better understand what they are feeling and how they might react in a range of situations. Automated systems are constructed to recognise emotions from analysis of speech or human voice using Artificial Intelligence (AI) or Machine Learning (ML) approaches, and these approaches are gaining momentum in recent research. This research aims to recognise a range of emotional states such as happy, sad, calm, angry, fear, disgust, surprise, or neutral from input speech signals with greater accuracy than currently seen in contemporary research. In order to achieve this aim, we have used the Support Vector Machine (SVM) classification algorithm and formed a feature vector by exploring speech features such as Mel Frequency Cepstral Coefficient (MFCC), Chroma, Mel-spectrogram, Spectral Centroid, Spectral Bandwidth, Spectral Roll-off, Root Mean Squared Energy (RMSE), and Zero Crossing Rate (ZCR) from speech signals. O. The system is tested on the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS), the Toronto Emotional Speech Set (TESS), and the Surrey Audio-Visual Expressed Emotion Database (SAVEE) datasets. Our proposed approach has achieved an overall accuracy of 99.59% on the RAVDESS dataset, 99.82% on the TESS dataset, and 98.95% on the SAVEE dataset for the SVM classifier. A mixed dataset is created from the three speech emotion datasets, which achieved significantly high classification accuracy compared with state-of-the-art methods. This model performs well on a large dataset, is ready to be tested with even bigger datasets, and can be used in a range of diverse applications, including education and clinical applications. GitHub: https://github.com/Mostafiz24/Speech-Emotion-Recognition.

Original languageEnglish
Title of host publicationApplied Intelligence and Informatics - 3rd International Conference, AII 2023, Revised Selected Papers
EditorsMufti Mahmud, Hanene Ben-Abdallah, M. Shamim Kaiser, Muhammad Raisuddin Ahmed, Ning Zhong
PublisherSpringer Science and Business Media Deutschland GmbH
Pages32-46
Number of pages15
ISBN (Print)9783031686382
DOIs
StatePublished - 2024
Externally publishedYes
Event3rd International Conference on Applied Intelligence and Informatics, AII 2023 - Dubai, United Arab Emirates
Duration: 29 Oct 202331 Oct 2023

Publication series

NameCommunications in Computer and Information Science
Volume2065 CCIS
ISSN (Print)1865-0929
ISSN (Electronic)1865-0937

Conference

Conference3rd International Conference on Applied Intelligence and Informatics, AII 2023
Country/TerritoryUnited Arab Emirates
CityDubai
Period29/10/2331/10/23

Bibliographical note

Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024.

Keywords

  • AI
  • Emotion Recognition
  • Feature Extraction
  • MFCC
  • ML
  • RAVDESS
  • RMSE
  • SAVEE
  • SVM
  • TESS
  • ZCR

ASJC Scopus subject areas

  • General Computer Science
  • General Mathematics

Fingerprint

Dive into the research topics of 'Speech Emotion Recognition: An Empirical Analysis of Machine Learning Algorithms Across Diverse Data Sets'. Together they form a unique fingerprint.

Cite this