Abstract | ||
---|---|---|
The performance of a focused, or topic-specific Web robot can be improved by taking into consideration the structure of the documents downloaded by the robot. In the case of HTML, document structure is tree-like, defined by nested document elements (tags) and their attributes. By analysing this structure, a robot may use the text of certain HTML elements to prioritise documents for downloading and thus significantly improve the speed of convergence to a topic. Clear separation of the structure-aware document parser from the download scheduler provides flexibility but requires a standard interface and protocol between the two. The paper discusses such an interface in the context of an experimental Web robot, whose speed of convergence to a topic was observed to increase by a factor of 3 to 8, as measured by the number of documents downloaded to reach a given average relevance score. |
Year | DOI | Venue |
---|---|---|
2011 | 10.1016/j.csi.2010.08.002 | Computer Standards & Interfaces |
Keywords | Field | DocType |
structured document,certain html element,topic-specific web robot,standard interface,experimental web robot,clear separation,nested document element,structure-aware document parser,download scheduler,average relevance score,focused web crawling,document structure,structural element,robot,web crawler,web crawling | HTML element,Information structure,Structured document,Web page,Information retrieval,Computer science,Document clustering,Document Structure Description,Parsing,Web crawler | Journal |
Volume | Issue | ISSN |
33 | 3 | 0920-5489 |
Citations | PageRank | References |
8 | 0.54 | 6 |
Authors | ||
2 |
Name | Order | Citations | PageRank |
---|---|---|---|
Ahmed Patel | 1 | 167 | 23.33 |
Nikita Schmidt | 2 | 335 | 18.25 |