Title
Arabic document layout analysis.
Abstract
Document layout analysis is a key step in the process of converting document images into text. Arabic language script is cursive and written in different styles which cause some challenges in the analysis of Arabic text documents. In this paper, we introduce an approach for Arabic documents layout analysis. In that approach, the document is segmented into set of zones using morphological operations. The segmented zones are classified as text or non-text ones using a support vector machine classifier. Features used in zone classification are combination between texture-based features and connected component-based features. The textural-based feature vector size is reduced using genetic algorithm. Classified text zones are clustered, using adaptive sample set clustering algorithm, into lines. Each segmented line is segmented into words by clustering inter- and intra-spaces. The proposed system was evaluated against two other systems that represent the best available tools for the Arabic documents analysis, and evaluation results show that the proposed system works well on multi-font and multi-size documents with a variety of layouts even on some historical documents.
Year
DOI
Venue
2017
10.1007/s10044-017-0595-x
Pattern Anal. Appl.
Keywords
Field
DocType
Layout analysis, Texture features, Connected component, Clustering, Genetic algorithm, Feature selection
Cursive,Feature vector,Arabic,Feature selection,Pattern recognition,Computer science,Document layout analysis,Artificial intelligence,Natural language processing,Connected component,Cluster analysis,Genetic algorithm
Journal
Volume
Issue
ISSN
20
4
1433-755X
Citations 
PageRank 
References 
1
0.35
6
Authors
6
Name
Order
Citations
PageRank
Amany M. Hesham110.35
Mohsen Rashwan23110.36
Hassanin Al-Barhamtoshy353.11
Sherif Abdou48613.33
Amr Badr56817.50
Ibrahim Farag6207.01