Title
Using Tree Automata and Regular Expressions to Manipulate Hierarchically Structured Data
Abstract
Information, stored or transmitted in digital form, is often structured. Individual data records are usually represented as hierarchies of their elements. Together, records form larger structures. Information processing applications have to take account of this structuring, which assigns different semantics to different data elements or records. Big variety of structural schemata in use today often requires much flexibility from applications—for example, to process information coming from different sources. To ensure application interoperability, translators are needed that can convert one structure into another. This paper puts forward a formal data model aimed at supporting hierarchical data pro- cessing in a simple and flexible way. The model is based on and extends results of two classical theories, studying finite string and tree automata. The concept of finite automata and regular languages is applied to the case of arbitrarily structured tree-like hierarchical data records, represented as "structured strings." These automata are compared with classical string and tree automata; the model is shown to be a superset of the classical models. Regular grammars and expressions over structured strings are introduced. Regular expression matching and substitution has been widely used for efficient unstruc- tured text processing; the model described here brings the power of this proven technique to applications that deal with information trees. A simple generic alternative is offered to replace today's specialised ad-hoc approaches. The model unifies structural and content transforma- tions, providing applications with a single data type. An example scenario of how to build applications based on this theory is discussed. Further research directions are outlined. Categories and subject descriptors: E.1: Data Structures—Trees; F.4.3 (Mathematical Logic and Formal Languages): Formal Languages—Classes defined by grammars or au- tomata; I.7.2 (Document and Text Processing): Document Preparation—Markup lan- guages; H.3.3 (Information Storage and Retrieval): Information Search and Retrieval— Query formulation
Year
Venue
Keywords
2002
Clinical Orthopaedics and Related Research
languages additional key words and phrases: data model,general terms: theory,information structure,hierarchy,tree automaton,regular expression,data type,data model,formal language,regular language,information processing,hierarchical data,data structure,finite automata
Field
DocType
Volume
Programming language,Computer science,Theoretical computer science,Data type,Artificial intelligence,Natural language processing,Regular language,Hierarchical database model,Text processing,Regular expression,Information processing,Finite-state machine,Data model,Machine learning
Journal
cs.CL/0201
Citations 
PageRank 
References 
0
0.34
2
Authors
2
Name
Order
Citations
PageRank
Nikita Schmidt133518.25
Ahmed Patel216723.33