Abstract
There are several digital libraries worldwide which maintain valuable historical manuscripts. Usually, digital copies of these manuscripts are offered to researchers and readers in raster-image format. These images carry several document degradations that may hinder automatic information retrieval solutions such as manuscript indexing, categorization, retrieval by content, etc. In this paper, we propose a learning-free and hybrid document layout analysis for handwritten historical manuscripts. It has two main phases: page characterization and segmentation. First, the proposed method locates main-content initially using top-down whitespace analysis. It employs anisotropic diffusion filtering to find whitespaces. Then, it extracts template features representing manuscripts’ authors writing behavior. After that, moving windows are used to scan the manuscript page and define main-content boundaries more precisely. We evaluated the proposed method on two datasets: One set is publicly available with 38 historical manuscript pages, and the other set of 51 historical manuscript pages that are collected from the online Harvard Library. Experiments on both datasets show promising results in terms of segmentation quality of main-content that reaches up to 98.5% success rate.
Original language | English |
---|---|
Pages (from-to) | 329-342 |
Number of pages | 14 |
Journal | International Journal on Digital Libraries |
Volume | 21 |
Issue number | 3 |
DOIs | |
State | Published - 1 Sep 2020 |
Bibliographical note
Publisher Copyright:© 2020, Springer-Verlag GmbH Germany, part of Springer Nature.
Keywords
- Anisotropic diffusion filtering
- Document analysis
- Document indexing
- Document retrieval
- Geometric feature
- Image segmentation
- Whitespace analysis
ASJC Scopus subject areas
- Library and Information Sciences