Abstract | ||
---|---|---|
We present two methods for automatic indexing, which are based on an interlingual layer of content description. In the first approach, we acquire indexing patterns from English documents by statistically relating interlingual representations of English documents (based on text token bigrams) to their associated index terms. Given such indexing patterns, we then induce the associated index terms when the same interlingual representations turn up for documents of other natural languages (viz. German and Portuguese). Hence, we 'learn' from the past English indexing experience and transfer it in an unsupervised way to non-English languages, without ever having seen any concrete indexing data for languages other than English. In the second approach, documents from the three different languages are heuristically matched with a sophisticated medical thesaurus (the English MESH) after both, documents and the thesaurus, have been transformed into the interlingua. The combination of the statistical and heuristical method in a fully automated indexing system achieves 56% to 68% of the human indexing performance for each of the three languages. |
Year | Venue | Keywords |
---|---|---|
2004 | RIAO | indexation,natural language,indexing terms,english language |
Field | DocType | Citations |
Information retrieval,Computer science,Interlingua,Portuguese,Search engine indexing,Natural language,Natural language processing,Bigram,Artificial intelligence,Automatic indexing,Security token,German | Conference | 7 |
PageRank | References | Authors |
0.62 | 16 | 5 |
Name | Order | Citations | PageRank |
---|---|---|---|
Kornél Markó | 1 | 103 | 10.17 |
Udo Hahn | 2 | 32 | 4.80 |
Stefan Schulz | 3 | 1092 | 127.03 |
Philipp Daumke | 4 | 34 | 7.34 |
Percy Nohama | 5 | 56 | 13.12 |