Abstract
Non-intrusive assessment of speech quality and intelligibility is essential when clean reference signals are unavailable. In this work, we propose a multimodal framework that integrates audio features and visual cues to predict PESQ and STOI scores. It employs a dual-branch architecture, where spectral features are extracted using STFT, and visual embeddings are obtained via a visual encoder. These features are then fused and processed by a CNN-BLSTM with attention, followed by multi-task learning to simultaneously predict PESQ and STOI. Evaluations on the LRS3-TED dataset, augmented with noise from the DEMAND corpus, show that our model outperforms the audio-only baseline. Under seen noise conditions, it improves LCC by 9.61% (0.8397→0.9205) for PESQ and 11.47% (0.7403→0.8253) for STOI. These results highlight the effectiveness of incorporating visual cues in enhancing the accuracy of non-intrusive speech assessment.
| Original language | English |
|---|---|
| Pages (from-to) | 5418-5422 |
| Number of pages | 5 |
| Journal | Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH |
| DOIs | |
| State | Published - 2025 |
| Externally published | Yes |
| Event | 26th Interspeech Conference 2025 - Rotterdam, Netherlands Duration: 17 Aug 2025 → 21 Aug 2025 |
Bibliographical note
Publisher Copyright:© 2025 International Speech Communication Association. All rights reserved.
Keywords
- multimodal learning
- non-intrusive speech assessment
- speech assessment
- speech quality estimation
ASJC Scopus subject areas
- Software
- Signal Processing
- Language and Linguistics
- Modeling and Simulation
- Human-Computer Interaction
Fingerprint
Dive into the research topics of 'A Study on Speech Assessment with Visual Cues'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver