Title
On knowledge-poor methods for person name matching and lemmatization for highly inflectional languages
Abstract
Web person search is one of the most common activities of Internet users. Recently, a vast amount of work on applying various NLP techniques for person name disambiguation in large web document collections has been reported, where the main focus was on English and few other major languages. This article reports on knowledge-poor methods for tackling person name matching and lemmatization in Polish, a highly inflectional language with complex person name declension paradigm. These methods apply mainly well-established string distance metrics, some new variants thereof, automatically acquired simple suffix-based lemmatization patterns and some combinations of the aforementioned techniques. Furthermore, we also carried out some initial experiments on deploying techniques that utilize the context, in which person names appear. Results of numerous experiments are presented. The evaluation carried out on a data set extracted from a corpus of on-line news articles revealed that achieving lemmatization accuracy figures greater than 90% seems to be difficult, whereas combining string distance metrics with suffix-based patterns results in 97.6---99% accuracy for the name matching task. Interestingly, no significant additional gain could be achieved through integrating some basic techniques, which try to exploit the local context the names appear in. Although our explorations were focused on Polish, we believe that the work presented in this article constitutes practical guidelines for tackling the same problem for other highly inflectional languages with similar phenomena.
Year
DOI
Venue
2009
10.1007/s10791-008-9085-5
Inf. Retr.
Keywords
Field
DocType
Person name matching,Highly inflectional languages,Lemmatization,String distance metrics
Declension,Lemmatisation,Web mining,Information retrieval,Suffix,Computer science,Exploit,Natural language processing,Artificial intelligence,String distance,Proper noun,The Internet
Journal
Volume
Issue
ISSN
12
3
1386-4564
Citations 
PageRank 
References 
16
0.83
22
Authors
3
Name
Order
Citations
PageRank
Jakub Piskorski143550.04
Karol Wieloch2262.46
Marcin Sydow326422.71