Abstract | ||
---|---|---|
Estimation of the degree of semantic similarity/distance between concepts is a very common problem in research areas such as natural language processing, knowledge acquisition, information retrieval or data mining. In the past, many similarity measures have been proposed, exploiting explicit knowledge--such as the structure of a taxonomy--or implicit knowledge--such as information distribution. In the former case, taxonomies and/or ontologies are used to introduce additional semantics; in the latter case, frequencies of term appearances in a corpus are considered. Classical measures based on those premises suffer from some problems: in the first case, their excessive dependency of the taxonomical/ontological structure; in the second case, the lack of semantics of a pure statistical analysis of occurrences and/or the ambiguity of estimating concept statistical distribution from term appearances. Measures based on Information Content (IC) of taxonomical concepts combine both approaches. However, they heavily depend on a properly pre-tagged and disambiguated corpus according to the ontological entities in order to compute accurate concept appearance probabilities. This limits the applicability of those measures to other ontologies ---like specific domain ontologies- and massive corpus ---like the Web-. In this paper, several of the presented issues are analyzed. Modifications of classical similarity measures are also proposed. They are based on a contextualized and scalable version of IC computation in the Web by exploiting taxonomical knowledge. The goal is to avoid the measures' dependency on the corpus pre-processing to achieve reliable results and minimize language ambiguity. Our proposals are able to outperform classical approaches when using the Web for estimating concept probabilities. |
Year | DOI | Venue |
---|---|---|
2010 | 10.1007/s10844-009-0103-x | J. Intell. Inf. Syst. |
Keywords | Field | DocType |
Semantic similarity,Ontologies,Information content,Web,Knowledge discovery | Data mining,Ontology,Computer science,Artificial intelligence,Natural language processing,Web application,Ambiguity,Semantic similarity,Ontology (information science),Information retrieval,Knowledge extraction,Knowledge acquisition,Machine learning,Semantics | Journal |
Volume | Issue | ISSN |
35 | 3 | 0925-9902 |
Citations | PageRank | References |
48 | 1.57 | 26 |
Authors | ||
4 |
Name | Order | Citations | PageRank |
---|---|---|---|
David Sánchez | 1 | 690 | 33.01 |
Montserrat Batet | 2 | 899 | 37.20 |
Aida Valls | 3 | 561 | 20.52 |
Karina Gibert | 4 | 281 | 34.01 |