Title
Spanners: a formal framework for information extraction
Abstract
An intrinsic part of information extraction is the creation and manipulation of relations extracted from text. In this paper, we develop a foundational framework where the central construct is what we call a spanner. A spanner maps an input string into relations over the spans (intervals specified by bounding indices) of the string. The focus of this paper is on the representation of spanners. Conceptually, there are two kinds of such representations. Spanners defined in a primitive representation extract relations directly from the input string; those defined in an algebra apply algebraic operations to the primitively represented spanners. This framework is driven by SystemT, an IBM commercial product for text analysis, where the primitive representation is that of regular expressions with capture variables. We define additional types of primitive spanner representations by means of two kinds of automata that assign spans to variables. We prove that the first kind has the same expressive power as regular expressions with capture variables; the second kind expresses precisely the algebra of the regular spanners---the closure of the first kind under standard relational operators. The core spanners extend the regular ones by string-equality selection (an extension used in SystemT). We give some fundamental results on the expressiveness of regular and core spanners. As an example, we prove that regular spanners are closed under difference (and complement), but core spanners are not. Finally, we establish connections with related notions in the literature.
Year
DOI
Venue
2013
10.1145/2463664.2463665
PODS
Keywords
Field
DocType
primitive representation,regular expression,capture variable,core spanner,primitive spanner representation,spanner map,primitive representation extract relation,information extraction,regular spanner,foundational framework,formal framework,input string,regular expressions,finite state automata
Discrete mathematics,Regular expression,Computer science,Automaton,Theoretical computer science,Finite-state machine,Information extraction,Relational operator,Spanner,Bounding overwatch,Algebraic operation
Conference
Citations 
PageRank 
References 
15
0.64
35
Authors
4
Name
Order
Citations
PageRank
Ronald Fagin188082643.66
Benny Kimelfeld2103471.63
Frederick Reiss3102571.10
Stijn Vansummeren466745.68