Phonetically rich and balanced Arabic speech corpus: An overview

  • Mohammad A.M. Abushariah
  • , Raja N. Ainon
  • , Roziati Zainuddin
  • , Othman O. Khalifa
  • , Moustafa Elshafei

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

13 Scopus citations

Abstract

Lack of spoken and written training data is one of the main issues encountered by Arabic automatic speech recognition (ASR) researchers. Almost all written and spoken corpora are not readily available to the public and many of them can only be obtained by purchasing from the Linguistic Data Consortium (LDC) or the European Language Resource Association (ELRA). There is more shortage of spoken training data as compared to written training data resulting in a great need for more speech corpora in order to serve different domains of Arabic ASR. The available spoken corpora were mainly collected from broadcast news (radios and televisions), and telephone conversations having certain technical and quality shortcomings. In order to produce a robust speaker-independent continuous automatic Arabic speech recognizer, a set of speech recordings that are rich and balanced is required. The rich characteristic is in the sense that it must contain all the phonemes of Arabic language. It must be balanced in preserving the phonetics distribution of Arabic language too. This set of speech recordings must be based on a proper written set of sentences and phrases created by experts. Therefore, it is crucial to create a high quality written (text) set of the sentences and phrases before recording them. This work adds a new kind of possible speech data for Arabic language based text and speech applications besides other kinds such as broadcast news and telephone conversations. Therefore, this work is an invitation to all Arabic ASR developers and research groups to explore and capitalize.

Original languageEnglish
Title of host publicationInternational Conference on Computer and Communication Engineering, ICCCE'10
DOIs
StatePublished - 2010
EventInternational Conference on Computer and Communication Engineering, ICCCE'10 - Kuala Lumpur, Malaysia
Duration: 11 May 201012 May 2010

Publication series

NameInternational Conference on Computer and Communication Engineering, ICCCE'10

Conference

ConferenceInternational Conference on Computer and Communication Engineering, ICCCE'10
Country/TerritoryMalaysia
CityKuala Lumpur
Period11/05/1012/05/10

Keywords

  • Arabic language
  • Automatic speech recognition
  • Phonetically balanced
  • Phonetically rich
  • Speech corpus

ASJC Scopus subject areas

  • Computer Networks and Communications

Fingerprint

Dive into the research topics of 'Phonetically rich and balanced Arabic speech corpus: An overview'. Together they form a unique fingerprint.

Cite this