Abstract
Humans often encounter text in natural scenes daily, such as on traffic signs, billboards, and walls. From a computer vision perspective, two main learning paradigms (text detection and text recognition) are most commonly explored for localizing and predicting text in natural scenes. However, many traditional computer vision algorithms for text recognition in natural scenes struggle with prediction accuracy due to variations in font styles, colors, blurriness, and text distortion. To address these challenges, this paper proposes a text recognition architecture that employs a fusion of multimodal contexts (vision and language models) trained on a multi-language video subtitle dataset, aimed at recognizing text (English letters and Arabic numbers) from video frames (scene images). To achieve this goal, the vision model was developed using a convolutional recurrent neural network (CRNN) integrated with a connectionist temporal classification decoder for feature extraction and text prediction. The language model was created using a sequence-to-sequence model (bidirectional gated recurrent unit: BiGRU) that learns text sequence representations and produces readable text output. The resulting proposed fused modality, known as Fusion-based CRNN with Sequence-to-Sequence (Fusion-CRNN+Seq2Seq), is used for recognizing text from images. The proposed method outperforms all other approaches and achieves the lowest character error rates of 1.36 and 1.22 based on different BiGRU network configurations.
Original language | English |
---|---|
Pages (from-to) | 83-91 |
Number of pages | 9 |
Journal | ICIC Express Letters, Part B: Applications |
Volume | 16 |
Issue number | 1 |
DOIs | |
State | Published - Jan 2025 |
Bibliographical note
Publisher Copyright:© 2025 ISSN.
Keywords
- Character error rate
- Connectionist temporal classification
- Convolutional recurrent neural networks
- Language model
- Sequencectocsequence
- Text recognition
- Vision model
ASJC Scopus subject areas
- General Computer Science