Abstract
Arabic script is complex, with multiple shapes for the same characters in different positions. Another challenge of the script, in the context of recognition, is ligatures. A combination of a specific two or more character sequence takes a different shape than what those characters normally look like when they appear in a similar position. Deep learning-based systems are widely used for text recognition these days. In this work, we investigate the performance of deep learning systems for two alternative modeling choices: using characters as modeling units and using character shapes as modeling units. Moreover, we also investigate the impact on text recognition with mixed typefaces, where the training and test sets have samples from multiple typefaces, and discuss the effect of font families on recognition performance. We extend this by studying the effectiveness of the text recognition system in recognizing text from unseen typefaces, i.e., text in the test set is from a typeface not available in training. Finally, we present a methodology to automatically detect ligatures in printed Arabic text. We conducted experiments on the publicly available APTI dataset of printed Arabic text and report the findings and discuss the results.
| Original language | English |
|---|---|
| Title of host publication | Document Analysis and Recognition – ICDAR 2023 Workshops, Proceedings |
| Editors | Mickael Coustaty, Alicia Fornés |
| Publisher | Springer Science and Business Media Deutschland GmbH |
| Pages | 5-18 |
| Number of pages | 14 |
| ISBN (Print) | 9783031415005 |
| DOIs | |
| State | Published - 2023 |
| Event | 17th International Conference on Document Analysis and Recognition, ICDAR 2023 - San José, United States Duration: 21 Aug 2023 → 26 Aug 2023 |
Publication series
| Name | Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) |
|---|---|
| Volume | 14194 LNCS |
| ISSN (Print) | 0302-9743 |
| ISSN (Electronic) | 1611-3349 |
Conference
| Conference | 17th International Conference on Document Analysis and Recognition, ICDAR 2023 |
|---|---|
| Country/Territory | United States |
| City | San José |
| Period | 21/08/23 → 26/08/23 |
Bibliographical note
Publisher Copyright:© 2023, The Author(s), under exclusive license to Springer Nature Switzerland AG.
Keywords
- Automatic Ligatures Identification
- Omnifont Text Recognition
- Printed Arabic Text OCR
- Unseen Font Text Recognition
ASJC Scopus subject areas
- Theoretical Computer Science
- General Computer Science