Application of structured document parsing to focused web crawling - Citegraph

Paper Info

Title
Application of structured document parsing to focused web crawling

Abstract
The performance of a focused, or topic-specific Web robot can be improved by taking into consideration the structure of the documents downloaded by the robot. In the case of HTML, document structure is tree-like, defined by nested document elements (tags) and their attributes. By analysing this structure, a robot may use the text of certain HTML elements to prioritise documents for downloading and thus significantly improve the speed of convergence to a topic. Clear separation of the structure-aware document parser from the download scheduler provides flexibility but requires a standard interface and protocol between the two. The paper discusses such an interface in the context of an experimental Web robot, whose speed of convergence to a topic was observed to increase by a factor of 3 to 8, as measured by the number of documents downloaded to reach a given average relevance score.

Year	DOI	Venue
2011	10.1016/j.csi.2010.08.002	Computer Standards & Interfaces
Keywords	Field	DocType
structured document,certain html element,topic-specific web robot,standard interface,experimental web robot,clear separation,nested document element,structure-aware document parser,download scheduler,average relevance score,focused web crawling,document structure,structural element,robot,web crawler,web crawling	HTML element,Information structure,Structured document,Web page,Information retrieval,Computer science,Document clustering,Document Structure Description,Parsing,Web crawler	Journal
Volume	Issue	ISSN
33	3	0920-5489
Citations	PageRank	References
8	0.54	6
Authors
2

Authors (2 rows)

Cited by (8 rows)

References (6 rows)

Name	Order	Citations	PageRank
Ahmed Patel	1	167	23.33
Nikita Schmidt	2	335	18.25

1