BPTI: Bilingual Printed Text Images Dataset for Recognition Purposes

Mohammad Yahia, Husni Al-Muhtaseb

Research output: Contribution to journalArticlepeer-review


Datasets of text images are important for optical text recognition systems. Such datasets can be used to enhance performance and recognition rates. In this research work, we present a bilingual dataset consists of Arabic/English text images to address the lack of availability of bilingual text databases. The presented dataset consists of 97812 text images, which are categorized into two groups; Scanned page and digitized line images. Images of the two forms are written with 10 fonts and four sizes, and prepared/scanned with four dpi resolutions. The dataset preparation process includes text collection, text editing, image construction, and image processing. The dataset can be used in optical text recognition, optical font recognition, language identification, and segmentation. Different text recognition and language identification experiments have been conducted using images of the dataset and Hidden Markov Model (HMM) classifier. For the digitized images recognition experiments, the best-achieved recognition correctness is 99.01% and the best accuracy is 99.01%. The font that has the highest recognition rates was Tahoma. For the scanned images recognition experiments, Tahoma has also shown the highest performance with 97.86% for correctness and 97.73% for accuracy. For the language identification experiments, Tahoma has shown the performance with 99.98% for word-language identification rate.

Original languageEnglish
Pages (from-to)655-668
Number of pages14
JournalInternational Arab Journal of Information Technology
Issue number4
StatePublished - Jul 2023

Bibliographical note

Publisher Copyright:
© 2023, Zarka Private University. All rights reserved.


  • HMM
  • Optical character recognition
  • text images dataset

ASJC Scopus subject areas

  • General Computer Science


Dive into the research topics of 'BPTI: Bilingual Printed Text Images Dataset for Recognition Purposes'. Together they form a unique fingerprint.

Cite this