Abstract
Pronunciation variation is a major obstacle in improving the performance of Arabic automatic continuous speech recognition systems. This phenomenon alters the pronunciation spelling of words beyond their listed forms in the pronunciation dictionary, leading to a number of out of vocabulary word forms. This paper presents a direct datadriven approach to model within-word pronunciation variations, in which the pronunciation variants are distilled from the training speech corpus. The proposed method consists of performing phoneme recognition, followed by a sequence alignment between the observation phonemes generated by the phoneme recognizer and the reference phonemes obtained from the pronunciation dictionary. The unique collected variants are then added to dictionary as well as to the language model. We started with a Baseline Arabic speech recognition system based on Sphinx3 engine. The Baseline system is based on a 5.4 hours speech corpus of modern standard Arabic broadcast news, with a pronunciation dictionary of 14,234 canonical pronunciations. The Baseline system achieves a word error rate of 13.39%. Our results show that while the expanded dictionary alone did not add appreciable improvements, the word error rate is significantly reduced by 2.22% when the variants are represented within the language model.
Original language | English |
---|---|
Pages (from-to) | 65-75 |
Number of pages | 11 |
Journal | International Journal of Speech Technology |
Volume | 15 |
Issue number | 2 |
DOIs | |
State | Published - Jun 2012 |
Bibliographical note
Funding Information:Acknowledgements This work is supported by Saudi Arabia Government research grant NSTP # (08-INF100-4). The authors would like also to thank King Fahd University of Petroleum and Minerals for its support of this research work.
Keywords
- Data-driven approach
- Language model
- Modern standard Arabic
- Pronunciation variation
- Speech recognition
ASJC Scopus subject areas
- Software
- Language and Linguistics
- Human-Computer Interaction
- Linguistics and Language
- Computer Vision and Pattern Recognition