Title
Nexir: A Novel Web Extraction Rule Language Toward A Three-Stage Web Data Extraction Model
Abstract
As the most popular information publishing platform, the Web contains a lot of valued data information of interests to users or applications. Nowadays, although a lot of data mining or analysis techniques have been studied in last decade, there are still not many easy-to-use web data mining tools available for users to extract useful data information from the Web. The web information extraction is a whole process involving web page navigation, data extraction and data integration. Unfortunately most of existing studies or systems lack of sufficient consideration toward the three-stage process. Also most of them lack the powerful rules to express the flexible extraction logic to extract data records with complicate structure. In this paper, we propose a novel web data extraction language, NEXIR, toward a three-stage web data extraction model. First of all, the language can define rules for system to automate the navigation process of the web pages, including deep web pages that need interactions from users. Then the language allows users to define flexible and complicated rules to extract data records from web pages and integrate extracted data into a pre-defined structure. A language engine and a prototype extraction system have been implemented based on the proposed language. The experimental results show that our language and system work effective and powerful compared with existing data extraction approaches.
Year
DOI
Venue
2013
10.1007/978-3-642-41230-1_3
WEB INFORMATION SYSTEMS ENGINEERING - WISE 2013, PT I
Keywords
Field
DocType
Web data extraction, Extraction Rule language, Data record, Web page navigation, Web data integration
Data integration,Data mining,World Wide Web,Data information,Information retrieval,Web page,Computer science,Web extraction,Information extraction,Deep Web,Data extraction,Data records
Conference
Volume
Issue
ISSN
8180
PART 1
0302-9743
Citations 
PageRank 
References 
2
0.37
28
Authors
7
Name
Order
Citations
PageRank
Shengsheng Shi192.16
Wu Wei231.07
Yulong Liu330.73
Haitao Wang453836.95
Lei Luo520.37
Chunfeng Yuan641830.84
Huang, Yihua716722.07