Page layout analysis has been extensively studied since the 1980`s, particularly after computers began to be used for document storage or database units. For efficient document storage and retrieval from a database, a paper document would be transformed into its electronic version. Algorithms and methodologies are used for document image analysis in order to segment a scanned document into different regions such as text, image or line regions. To contribute a novel approach in the field of page layout analysis and classification, this algorithm is developed for both RGB space and grey-scale scanned documents without requiring any specific document types, and scanning techniques. In this thesis, a page classification algorithm is proposed which mainly applies wavelet transform, Markov random field (MRF) and Hough transform to segment text, photo and strong edge/ line regions in both color and gray-scale scanned documents. The algorithm is developed to handle both simple and complex page layout structures and contents (text only vs. book cover that includes text, lines and/or photos). The methodology consists of five modules. In the first module, called pre-processing, image enhancements techniques such as image scaling, filtering, color space conversion or gamma correction are applied in order to reduce computation time and enhance the scanned document. The techniques, used to perform the classification, are employed on the one-fourth resolution input image in the CIEL*a*b* color space. In the second module, the text detection module uses wavelet analysis to generate a text-region candidate map which is enhanced by applying a Run Length Encoding (RLE) technique for verification purposes. The third module, photo detection, initially uses block-wise segmentation which is based on basis vector projection technique. Then, MRF with maximum a-posteriori (MAP) optimization framework is utilized to generate photo map. Next, Hough transform is applied to locate lines in the fourth module. Techniques for edge detection, edge linkages, and line-segment fitting are used to detect strong-edges in the module as well. After those three classification maps are obtained, in the last module a final page layout map is generated by using K-Means. Features are extracted to classify the intersection regions and merge into one classification map with K-Means clustering. The proposed technique is tested on several hundred images and its performance is validated by utilizing Confusion Matrix (CM). It shows that the technique achieves an average of 85% classification accuracy rate in text, photo, and background regions on a variety of scanned documents like articles, magazines, business-cards, dictionaries or newsletters etc. More importantly, it performs independently from a scanning process and an input scanned document (RGB or gray-scale) with comparable classification quality.
Library of Congress Subject Headings
Layout (Printing)--Data processing; Classification--Data processing; Image processing--Digital techniques; Optical pattern recognition
Department, Program, or Center
Microelectronic Engineering (KGCOE)
Erkilinc, Mustafa, "Page layout analysis and classification in complex scanned documents" (2011). Thesis. Rochester Institute of Technology. Accessed from
RIT – Main Campus