Title
Combining URL and HTML Features for Entity Discovery in the Web.
Abstract
The web is a large repository of entity-pages. An entity-page is a page that publishes data representing an entity of a particular type, for example, a page that describes a driver on a website about a car racing championship. The attribute values published in the entity-pages can be used for many data-driven companies, such as insurers, retailers, and search engines. In this article, we define a novel method, called SSUP, which discovers the entity-pages on the websites. The novelty of our method is that it combines URL and HTML features in a way that allows the URL terms to have different weights depending on their capacity to distinguish entity-pages from other pages, and thus the efficacy of the entity-page discovery task is increased. SSUP determines the similarity thresholds on each website without human intervention. We carried out experiments on a dataset with different real-world websites and a wide range of entity types. SSUP achieved a 95% rate of precision and 85% recall rate. Our method was compared with two state-of-the-art methods and outperformed them with a precision gain between 51% and 66%.
Year
DOI
Venue
2019
10.1145/3365574
ACM Transactions on the Web
Keywords
Field
DocType
URL and HTML features,crawler,entity-pages,web structure mining
World Wide Web,Search engine,Information retrieval,Recall rate,Championship,Computer science,Novelty,Car racing,Web crawler
Journal
Volume
Issue
ISSN
13
4
1559-1131
Citations 
PageRank 
References 
0
0.34
0
Authors
3
Name
Order
Citations
PageRank
Edimar Manica182.19
Carina F. Dorneles26110.35
Renata Galante39913.96