Abstract | ||
---|---|---|
The web is a large repository of entity-pages. An entity-page is a page that publishes data representing an entity of a particular type, for example, a page that describes a driver on a website about a car racing championship. The attribute values published in the entity-pages can be used for many data-driven companies, such as insurers, retailers, and search engines. In this article, we define a novel method, called SSUP, which discovers the entity-pages on the websites. The novelty of our method is that it combines URL and HTML features in a way that allows the URL terms to have different weights depending on their capacity to distinguish entity-pages from other pages, and thus the efficacy of the entity-page discovery task is increased. SSUP determines the similarity thresholds on each website without human intervention. We carried out experiments on a dataset with different real-world websites and a wide range of entity types. SSUP achieved a 95% rate of precision and 85% recall rate. Our method was compared with two state-of-the-art methods and outperformed them with a precision gain between 51% and 66%.
|
Year | DOI | Venue |
---|---|---|
2019 | 10.1145/3365574 | ACM Transactions on the Web |
Keywords | Field | DocType |
URL and HTML features,crawler,entity-pages,web structure mining | World Wide Web,Search engine,Information retrieval,Recall rate,Championship,Computer science,Novelty,Car racing,Web crawler | Journal |
Volume | Issue | ISSN |
13 | 4 | 1559-1131 |
Citations | PageRank | References |
0 | 0.34 | 0 |
Authors | ||
3 |
Name | Order | Citations | PageRank |
---|---|---|---|
Edimar Manica | 1 | 8 | 2.19 |
Carina F. Dorneles | 2 | 61 | 10.35 |
Renata Galante | 3 | 99 | 13.96 |