Language determination: natural language processing from scanned document images - Citegraph

Paper Info

Title
Language determination: natural language processing from scanned document images

Abstract
Many documents are available to a computer only as images from paper. However, most natural language processing systems expect their input as character-coded text, which may be difficult or expensive to extract accurately from the page. We describe a method for converting a document image into character shape codes and word shape tokens. We believe that this representation, which is both cheap and robust, is sufficient for many NLP tasks. In this paper, we show that the representation is sufficient for determining which of 23 languages the document is written in, using only a small number of features, with greater than 90% accuracy overall.

Year	DOI	Venue
1994	10.3115/974358.974363	ANLP
Keywords	Field	DocType
character-coded text,character shape code,language determination,nlp task,small number,natural language processing system,accuracy overall,document image,scanned document image,word shape token,natural language processing	Cache language model,Question answering,Computer science,Natural language programming,Information extraction,Natural language processing,Universal Networking Language,Language identification,Artificial intelligence,Low-level programming language,Language primitive	Conference
Citations	PageRank	References
31	11.31	3
Authors
2

Authors (2 rows)

Cited by (31 rows)

References (3 rows)

Name	Order	Citations	PageRank
Penelope Sibun	1	284	187.65
A. Lawrence Spitz	2	234	37.48

1