Abstract
Personalised speech enhancement (PSE) and audio-visual (AV) speech enhancement (SE) have emerged as promising approaches to improve speech quality and intelligibility in challenging acoustic environments. PSE leverages individual-specific vocal characteristics to address the label permutation problem, while AV SE incorporates visual cues, particularly lip movements, to complement auditory signals in noisy conditions where speech is degraded by competing noise sources. This paper presents a novel framework that unifies these two, advancing towards personalised AV SE. By integrating raw enrolment audio for adaptive target speaker representation with AV inputs the proposed system aims to achieve robust SE in real-world environments. Experimental results demonstrate significant improvements in speech intelligibility and noise suppression on the COG-MHEAR Audio-Visual Speech Enhancement Challenge dataset, outperforming state-of-the-art PSE and AV SE models.
| Original language | English |
|---|---|
| Pages (from-to) | 4853-4857 |
| Number of pages | 5 |
| Journal | Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH |
| DOIs | |
| State | Published - 2025 |
| Externally published | Yes |
| Event | 26th Interspeech Conference 2025 - Rotterdam, Netherlands Duration: 17 Aug 2025 → 21 Aug 2025 |
Bibliographical note
Publisher Copyright:© 2025 International Speech Communication Association. All rights reserved.
Keywords
- multimodal processing
- speech enhancement
- speech separation
ASJC Scopus subject areas
- Software
- Signal Processing
- Language and Linguistics
- Modeling and Simulation
- Human-Computer Interaction
Fingerprint
Dive into the research topics of 'Towards Personalised Audio Visual Speech Enhancement'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver