Combining URL and HTML Features for Entity Discovery in the Web. - Citegraph

Paper Info

Title
Combining URL and HTML Features for Entity Discovery in the Web.

Abstract
The web is a large repository of entity-pages. An entity-page is a page that publishes data representing an entity of a particular type, for example, a page that describes a driver on a website about a car racing championship. The attribute values published in the entity-pages can be used for many data-driven companies, such as insurers, retailers, and search engines. In this article, we define a novel method, called SSUP, which discovers the entity-pages on the websites. The novelty of our method is that it combines URL and HTML features in a way that allows the URL terms to have different weights depending on their capacity to distinguish entity-pages from other pages, and thus the efficacy of the entity-page discovery task is increased. SSUP determines the similarity thresholds on each website without human intervention. We carried out experiments on a dataset with different real-world websites and a wide range of entity types. SSUP achieved a 95% rate of precision and 85% recall rate. Our method was compared with two state-of-the-art methods and outperformed them with a precision gain between 51% and 66%.

Year	DOI	Venue
2019	10.1145/3365574	ACM Transactions on the Web
Keywords	Field	DocType
URL and HTML features,crawler,entity-pages,web structure mining	World Wide Web,Search engine,Information retrieval,Recall rate,Championship,Computer science,Novelty,Car racing,Web crawler	Journal
Volume	Issue	ISSN
13	4	1559-1131
Citations	PageRank	References
0	0.34	0
Authors
3

Authors (3 rows)

Cited by (0 rows)

References (0 rows)

Name	Order	Citations	PageRank
Edimar Manica	1	8	2.19
Carina F. Dorneles	2	61	10.35
Renata Galante	3	99	13.96

1