Part of speech tagging approach to designing compound words for arabic continuous speech recognition systems

Dia AbuZeina*, Moustafa Elshafei, Wasfi Al-Khatib

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Scopus citations

Abstract

Misrecognition of small words is one of the factors that lead to suboptimal performance in automatic continuous speech recognition systems. In general, errors generated from small words are much more than errors in long words. Therefore, compounding some words (small or long) to produce longer words is welcome by speech recognition decoders. In this paper, we present a novel approach to artificially generate compound words using part of speech tagging. For this purpose, we consider two Arabic pronunciation cases that usually occur together without any silence: a noun followed by an adjective, and a preposition followed by any other word. To collect the candidate compound words, we use Stanford Arabic tagger to tag all words in our Baseline transcription corpus. Using Sphinx 3, we test the proposed method on a 5.4 hours speech corpus of modern standard Arabic. The results show significant improvement, with the word error rate being reduced by 2.39%.

Original languageEnglish
Title of host publicationInformatics Engineering and Information Science - International Conference, ICIEIS 2011, Proceedings
Pages330-338
Number of pages9
EditionPART 4
DOIs
StatePublished - 2011

Publication series

NameCommunications in Computer and Information Science
NumberPART 4
Volume254 CCIS
ISSN (Print)1865-0929

Keywords

  • Modern Standard Arabic
  • compound words
  • language model
  • part of speech tagging
  • speech recognition

ASJC Scopus subject areas

  • General Computer Science
  • General Mathematics

Fingerprint

Dive into the research topics of 'Part of speech tagging approach to designing compound words for arabic continuous speech recognition systems'. Together they form a unique fingerprint.

Cite this