Skip to main navigation Skip to search Skip to main content

A Study on Speech Assessment with Visual Cues

  • Shafique Ahmed
  • , Ryandhimas E. Zezario
  • , Nasir Saleem
  • , Amir Hussain
  • , Hsin Min Wang
  • , Yu Tsao

Research output: Contribution to journalConference articlepeer-review

1 Scopus citations

Abstract

Non-intrusive assessment of speech quality and intelligibility is essential when clean reference signals are unavailable. In this work, we propose a multimodal framework that integrates audio features and visual cues to predict PESQ and STOI scores. It employs a dual-branch architecture, where spectral features are extracted using STFT, and visual embeddings are obtained via a visual encoder. These features are then fused and processed by a CNN-BLSTM with attention, followed by multi-task learning to simultaneously predict PESQ and STOI. Evaluations on the LRS3-TED dataset, augmented with noise from the DEMAND corpus, show that our model outperforms the audio-only baseline. Under seen noise conditions, it improves LCC by 9.61% (0.8397→0.9205) for PESQ and 11.47% (0.7403→0.8253) for STOI. These results highlight the effectiveness of incorporating visual cues in enhancing the accuracy of non-intrusive speech assessment.

Original languageEnglish
Pages (from-to)5418-5422
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
DOIs
StatePublished - 2025
Externally publishedYes
Event26th Interspeech Conference 2025 - Rotterdam, Netherlands
Duration: 17 Aug 202521 Aug 2025

Bibliographical note

Publisher Copyright:
© 2025 International Speech Communication Association. All rights reserved.

Keywords

  • multimodal learning
  • non-intrusive speech assessment
  • speech assessment
  • speech quality estimation

ASJC Scopus subject areas

  • Software
  • Signal Processing
  • Language and Linguistics
  • Modeling and Simulation
  • Human-Computer Interaction

Fingerprint

Dive into the research topics of 'A Study on Speech Assessment with Visual Cues'. Together they form a unique fingerprint.

Cite this