Title
Application of structured document parsing to focused web crawling
Abstract
The performance of a focused, or topic-specific Web robot can be improved by taking into consideration the structure of the documents downloaded by the robot. In the case of HTML, document structure is tree-like, defined by nested document elements (tags) and their attributes. By analysing this structure, a robot may use the text of certain HTML elements to prioritise documents for downloading and thus significantly improve the speed of convergence to a topic. Clear separation of the structure-aware document parser from the download scheduler provides flexibility but requires a standard interface and protocol between the two. The paper discusses such an interface in the context of an experimental Web robot, whose speed of convergence to a topic was observed to increase by a factor of 3 to 8, as measured by the number of documents downloaded to reach a given average relevance score.
Year
DOI
Venue
2011
10.1016/j.csi.2010.08.002
Computer Standards & Interfaces
Keywords
Field
DocType
structured document,certain html element,topic-specific web robot,standard interface,experimental web robot,clear separation,nested document element,structure-aware document parser,download scheduler,average relevance score,focused web crawling,document structure,structural element,robot,web crawler,web crawling
HTML element,Information structure,Structured document,Web page,Information retrieval,Computer science,Document clustering,Document Structure Description,Parsing,Web crawler
Journal
Volume
Issue
ISSN
33
3
0920-5489
Citations 
PageRank 
References 
8
0.54
6
Authors
2
Name
Order
Citations
PageRank
Ahmed Patel116723.33
Nikita Schmidt233518.25