Learning segmentation of documents with complex scripts - Citegraph

Paper Info

Title
Learning segmentation of documents with complex scripts

Abstract
Most of the state-of-the-art segmentation algorithms are designed to handle complex document layouts and backgrounds, while assuming a simple script structure such as in Roman script. They perform poorly when used with Indian languages, where the components are not strictly collinear. In this paper, we propose a document segmentation algorithm that can handle the complexity of Indian scripts in large document image collections. Segmentation is posed as a graph cut problem that incorporates the apriori information from script structure in the objective function of the cut. We show that this information can be learned automatically and be adapted within a collection of documents (a book) and across collections to achieve accurate segmentation. We show the results on Indian language documents in Telugu script. The approach is also applicable to other languages with complex scripts such as Bangla, Kannada, Malayalam, and Urdu.

Year	DOI	Venue
2006	10.1007/11949619_67	ICVGIP
Keywords	Field	DocType
roman script,indian script,state-of-the-art segmentation algorithm,simple script structure,accurate segmentation,script structure,complex script,telugu script,document segmentation algorithm,indian language,graph cut	Cut,Computer science,Artificial intelligence,Natural language processing,Information structure,Pattern recognition,Malayalam,Segmentation,Document processing,Speech recognition,Bengali,Latin script,Scripting language	Conference
Volume	ISSN	ISBN
4338	0302-9743	3-540-68301-1
Citations	PageRank	References
9	0.58	12
Authors
3

Authors (3 rows)

Cited by (9 rows)

References (12 rows)

Name	Order	Citations	PageRank
S. Kumar	1	36	4.04
Anoop M. Namboodiri	2	255	26.36
C. V. Jawahar	3	1700	148.58

1