TY - GEN
T1 - Phonetically rich and balanced Arabic speech corpus
T2 - International Conference on Computer and Communication Engineering, ICCCE'10
AU - Abushariah, Mohammad A.M.
AU - Ainon, Raja N.
AU - Zainuddin, Roziati
AU - Khalifa, Othman O.
AU - Elshafei, Moustafa
PY - 2010
Y1 - 2010
N2 - Lack of spoken and written training data is one of the main issues encountered by Arabic automatic speech recognition (ASR) researchers. Almost all written and spoken corpora are not readily available to the public and many of them can only be obtained by purchasing from the Linguistic Data Consortium (LDC) or the European Language Resource Association (ELRA). There is more shortage of spoken training data as compared to written training data resulting in a great need for more speech corpora in order to serve different domains of Arabic ASR. The available spoken corpora were mainly collected from broadcast news (radios and televisions), and telephone conversations having certain technical and quality shortcomings. In order to produce a robust speaker-independent continuous automatic Arabic speech recognizer, a set of speech recordings that are rich and balanced is required. The rich characteristic is in the sense that it must contain all the phonemes of Arabic language. It must be balanced in preserving the phonetics distribution of Arabic language too. This set of speech recordings must be based on a proper written set of sentences and phrases created by experts. Therefore, it is crucial to create a high quality written (text) set of the sentences and phrases before recording them. This work adds a new kind of possible speech data for Arabic language based text and speech applications besides other kinds such as broadcast news and telephone conversations. Therefore, this work is an invitation to all Arabic ASR developers and research groups to explore and capitalize.
AB - Lack of spoken and written training data is one of the main issues encountered by Arabic automatic speech recognition (ASR) researchers. Almost all written and spoken corpora are not readily available to the public and many of them can only be obtained by purchasing from the Linguistic Data Consortium (LDC) or the European Language Resource Association (ELRA). There is more shortage of spoken training data as compared to written training data resulting in a great need for more speech corpora in order to serve different domains of Arabic ASR. The available spoken corpora were mainly collected from broadcast news (radios and televisions), and telephone conversations having certain technical and quality shortcomings. In order to produce a robust speaker-independent continuous automatic Arabic speech recognizer, a set of speech recordings that are rich and balanced is required. The rich characteristic is in the sense that it must contain all the phonemes of Arabic language. It must be balanced in preserving the phonetics distribution of Arabic language too. This set of speech recordings must be based on a proper written set of sentences and phrases created by experts. Therefore, it is crucial to create a high quality written (text) set of the sentences and phrases before recording them. This work adds a new kind of possible speech data for Arabic language based text and speech applications besides other kinds such as broadcast news and telephone conversations. Therefore, this work is an invitation to all Arabic ASR developers and research groups to explore and capitalize.
KW - Arabic language
KW - Automatic speech recognition
KW - Phonetically balanced
KW - Phonetically rich
KW - Speech corpus
UR - https://www.scopus.com/pages/publications/77957778000
U2 - 10.1109/ICCCE.2010.5556832
DO - 10.1109/ICCCE.2010.5556832
M3 - Conference contribution
AN - SCOPUS:77957778000
SN - 9781424462346
T3 - International Conference on Computer and Communication Engineering, ICCCE'10
BT - International Conference on Computer and Communication Engineering, ICCCE'10
Y2 - 11 May 2010 through 12 May 2010
ER -