Mining the Web for lists of Named Entities - Citegraph

Paper Info

Title
Mining the Web for lists of Named Entities

Abstract
Named entities play an important role in Information Extraction. They represent unitary namable information within text. In this work, we focus on groups of named entities of the same type which we try to extract from HTML lists. Instead of starting from a class and identifying the corresponding named entities, we want to explore a new paradigm which consists in identifying sets of named entities without any knowledge on the class. A clear advantage of the approach is that it is applicable to all named entities (no matter what class), which makes it domain independent. We use HTML lists to collect candidate sets of named entities. Human assessors assessed a randomly selected sample of HTML lists. 8,25% of these HTML lists are lists of named entities of the same class. If our estimation is validated at large scale, it is possible to expect at least 890 million of such lists of named entities only in the indexed Web. Moreover, we propose an appropriate classifier which shows promising results. RESUME. Les entites nommees jouent un role important en extraction d’information. Dans cet article, nous proposons une methode pour extraire des entites nommees de la meme classe au sein de listes HTML. Au lieu de partir d’une classe donnee et d’extraire les entites correspondantes, nous proposons une nouvelle approche qui consiste a identifier des ensembles d’entites nommees sans connaitre leur classe d’appartenance. Un avantage evident de cette approche est qu’elle peut s’appliquer a tout type d’entite nommee (c’est a dire a des entites nommees de n’importe quelle classe). Nous utilisons des listes HTML pour identifier des ensembles candidats d’entites. Afin d’evaluer notre approche, des juges ont evalue un echantillon de listes HTML issues du Web. 8.25% de ces listes sont des listes d’entites nommees de la meme classe. On peut ainsi s’attendre a trouver plus de 890 millions de listes d’entites nommees appartenant a la meme classe sur tout le Web indexe. Le classifieur que nous proposons dans cet article et permettant d’identifier ces listes d’entites nommees pertinentes nous permet d’obtenir de premiers resultats prometteurs.

Year	Venue	Field
2011	CORIA	Entity linking,Identifier,Computer science,Information extraction,Library science
DocType	Citations	PageRank
Conference	1	0.34
References	Authors
10	3

Authors (3 rows)

Cited by (1 rows)

References (10 rows)

Name	Order	Citations	PageRank
Arlind Kopliku	1	64	9.45
Mohand Boughanem	2	923	109.00
Karen Pinel-Sauvagnat	3	1	0.34

1