Title
Web page title extraction and its application
Abstract
This paper is concerned with automatic extraction of titles from the bodies of HTML documents (web pages). Titles of HTML documents should be correctly defined in the title fields by the authors; however, in reality they are often bogus. It is advantageous if we can automatically extract titles from HTML documents. In this paper, we take a supervised machine learning approach to address the problem. We first propose a specification on HTML titles, that is, a 'definition' on HTML titles. Next, we employ two learning methods to perform the task. In one method, we utilize features extracted from the DOM (direct object model) Tree; in the other method, we utilize features based on vision. We also combine the two methods to further enhance the extraction accuracy. Our title extraction methods significantly outperform the baseline method of using the lines in largest font size as title (22.6-37.4% improvements in terms of F1 score). As application, we consider web page retrieval. We use the TREC Web Track data for evaluation. We propose a new method for HTML documents retrieval using extracted titles. Experimental results indicate that the use of both extracted titles and title fields is almost always better than the use of title fields alone; the use of extracted titles is particularly helpful in the task of named page finding (25.1-30.3% improvements).
Year
DOI
Venue
2007
10.1016/j.ipm.2006.11.007
Inf. Process. Manage.
Keywords
Field
DocType
information retrieval,html documents retrieval,metadata extraction,html title,web page title extraction,title field,extraction accuracy,automatic extraction,html document,new method,learning method,title extraction method,object model,web pages,feature extraction,document retrieval
Metadata,F1 score,Point (typography),Web page,Information retrieval,Computer science,Object model,Information extraction,HTML
Journal
Volume
Issue
ISSN
43
5
Information Processing and Management
Citations 
PageRank 
References 
21
0.87
30
Authors
8
Name
Order
Citations
PageRank
Yewei Xue1251.27
Yunhua Hu221111.01
Guomao Xin31076.85
Ruihua Song4113859.33
Shuming Shi562058.27
Yunbo Cao6108263.12
Chin-Yew Lin73170242.72
Hang Li86294317.05