Typefaces and Ligatures in Printed Arabic Text: A Deep Learning-Based OCR Perspective

Omar Alhubaiti, Irfan Ahmad*

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

1 Scopus citations

Abstract

Arabic script is complex, with multiple shapes for the same characters in different positions. Another challenge of the script, in the context of recognition, is ligatures. A combination of a specific two or more character sequence takes a different shape than what those characters normally look like when they appear in a similar position. Deep learning-based systems are widely used for text recognition these days. In this work, we investigate the performance of deep learning systems for two alternative modeling choices: using characters as modeling units and using character shapes as modeling units. Moreover, we also investigate the impact on text recognition with mixed typefaces, where the training and test sets have samples from multiple typefaces, and discuss the effect of font families on recognition performance. We extend this by studying the effectiveness of the text recognition system in recognizing text from unseen typefaces, i.e., text in the test set is from a typeface not available in training. Finally, we present a methodology to automatically detect ligatures in printed Arabic text. We conducted experiments on the publicly available APTI dataset of printed Arabic text and report the findings and discuss the results.

Original languageEnglish
Title of host publicationDocument Analysis and Recognition – ICDAR 2023 Workshops, Proceedings
EditorsMickael Coustaty, Alicia Fornés
PublisherSpringer Science and Business Media Deutschland GmbH
Pages5-18
Number of pages14
ISBN (Print)9783031415005
DOIs
StatePublished - 2023
Event17th International Conference on Document Analysis and Recognition, ICDAR 2023 - San José, United States
Duration: 21 Aug 202326 Aug 2023

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume14194 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference17th International Conference on Document Analysis and Recognition, ICDAR 2023
Country/TerritoryUnited States
CitySan José
Period21/08/2326/08/23

Bibliographical note

Publisher Copyright:
© 2023, The Author(s), under exclusive license to Springer Nature Switzerland AG.

Keywords

  • Automatic Ligatures Identification
  • Omnifont Text Recognition
  • Printed Arabic Text OCR
  • Unseen Font Text Recognition

ASJC Scopus subject areas

  • Theoretical Computer Science
  • General Computer Science

Fingerprint

Dive into the research topics of 'Typefaces and Ligatures in Printed Arabic Text: A Deep Learning-Based OCR Perspective'. Together they form a unique fingerprint.

Cite this