Skip to main navigation Skip to search Skip to main content

Leveraging Self-Supervised Audio-Visual Pretrained Models to Improve Vocoded Speech Intelligibility in Cochlear Implant Simulation

  • Richard Lee Lai
  • , Jen Cheng Hou
  • , I. Chun Chern
  • , Kuo Hsuan Hung
  • , Yi Ting Chen
  • , Mandar Gogate
  • , Tughrul Arslan
  • , Amir Hussain
  • , Chii Wann Lin*
  • , Yu Tsao
  • *Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Objective: Individuals with hearing impairments face challenges in their ability to comprehend speech, particularly in noisy environments. This study explores the effectiveness of audio-visual speech enhancement (AVSE) in improving the intelligibility of vocoded speech in cochlear implant (CI) simulations. Methods: We propose a speech enhancement framework called Self-Supervised Learning-based AVSE (SSL-AVSE), which uses visual cues such as lip and mouth movements along with corresponding speech. Features are extracted using the AV-HuBERT model and refined through a bidirectional LSTM. Experiments were conducted using the Taiwan Mandarin speech with video (TMSV) dataset. Results: Objective evaluations showed improvements in PESQ from 1.43 to 1.67 and in STOI from 0.70 to 0.74. NCM scores increased by up to 87.2% over the noisy baseline. Subjective listening tests further demonstrated maximum gains of 45.2% in speech quality and 51.9% in word intelligibility. Conclusion: SSL-AVSE consistently outperforms audio-only speech enhancement (AOSE) and conventional AVSE baselines. Listening tests with statistically significant results confirm its effectiveness. In addition to its strong performance, SSL-AVSE demonstrates cross-lingual generalization: although it was pretrained on English data, it performs effectively on Mandarin speech. This finding highlights the robustness of the features extracted by a pretrained foundation model and their applicability across languages. Significance: To the best of our knowledge, no prior work has explored the application of AVSE to CI simulations. This study provides the first evidence that incorporating visual information can significantly improve the intelligibility of vocoded speech in CI scenarios.

Original languageEnglish
Pages (from-to)1561-1572
Number of pages12
JournalIEEE Transactions on Biomedical Engineering
Volume73
Issue number4
DOIs
StatePublished - 1 Apr 2026
Externally publishedYes

Bibliographical note

Publisher Copyright:
© 1964-2012 IEEE.

Keywords

  • Audio-visual speech enhancement
  • cochlear implants
  • cross-lingual generalization
  • self-supervised learning

ASJC Scopus subject areas

  • Biomedical Engineering

Fingerprint

Dive into the research topics of 'Leveraging Self-Supervised Audio-Visual Pretrained Models to Improve Vocoded Speech Intelligibility in Cochlear Implant Simulation'. Together they form a unique fingerprint.

Cite this