Title
Extracting unstructured data from template generated web documents
Abstract
We propose a novel approach that identifies web page templates and extracts the unstructured data. Extracting only the body of the page and eliminating the template increases the retrieval precision for the queries that generate irrelevant results. We believe that by reducing the number of irrelevant results; the users are encouraged to go back to a given site to search. Our experimental results on several different web sites and on the whole cnnfn collection demonstrate the feasibility of our approach.
Year
DOI
Venue
2003
10.1145/956863.956961
CIKM
Keywords
Field
DocType
whole cnnfn collection,web page template,novel approach,different web site,unstructured data,retrieval precision,web document,irrelevant result,web pages,information retrieval
Data mining,Site map,Web page,Information retrieval,Computer science,Website Parse Template,Unstructured data,Information extraction,Template
Conference
ISBN
Citations 
PageRank 
1-58113-723-0
17
1.02
References 
Authors
8
4
Name
Order
Citations
PageRank
Ling Ma1505.36
Nazli Goharian246049.93
Abdur Chowdhury32013160.59
Misun Chung4171.02