Title
Document Understanding System Using Stochastic Context-Free Grammars
Abstract
We present a document understanding system in which the arrangement of lines of text and block separators within a document are modeled by stochastic context free grammars. A grammar corresponds to a document genre; our system may be adapted to a new genre simply by replacing the input grammar. The system incorporates an optical character recognition system that outputs characters, their positions and font sizes. These features are combined to form a document representation of lines of text and separators. Lines of text are labeled as tokens using regular expression matching. The maximum likelihood parse of this stream of tokens and separators yields a functional labeling of the document lines. We describe business card and business letter applications.
Year
DOI
Venue
2005
10.1109/ICDAR.2005.93
ICDAR-1
Keywords
Field
DocType
free grammar,business card,document genre,grammar corresponds,document representation,document understanding system,separators yield,stochastic context-free grammars,document line,optical character recognition system,business letter application,stochastic context free grammar,optical character recognition,context free grammars,stochastic processes,maximum likelihood estimation,data processing,maximum likelihood,feature extraction,information extraction
Point (typography),Regular expression,Context-free grammar,Pattern recognition,Computer science,Document layout analysis,Optical character recognition,Grammar,Information extraction,Natural language processing,Artificial intelligence,Parsing
Conference
ISSN
ISBN
Citations 
1520-5363
0-7695-2420-6
4
PageRank 
References 
Authors
0.44
7
3
Name
Order
Citations
PageRank
John C. Handley14413.08
Anoop M. Namboodiri225526.36
Richard Zanibbi345238.74