Abstract | ||
---|---|---|
In this paper, we introduce the task of selecting compact lexicon from large, noisy gazetteers. This scenario arises often in practice, in particular spoken language understanding (SLU). We propose a simple and effective solution based on matrix decomposition techniques: canonical correlation analysis (CCA) and rank-revealing QR (RRQR) factorization. CCA is first used to derive low-dimensional gazetteer embeddings from domain-specific search logs. Then RRQR is used to find a subset of these embeddings whose span approximates the entire lexicon space. Experiments on slot tagging show that our method yields a small set of lexicon entities with average relative error reduction of > 50% over randomly selected lexicon. |
Year | Venue | Field |
---|---|---|
2015 | PROCEEDINGS OF THE 53RD ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL) AND THE 7TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (IJCNLP), VOL 2 | Pattern recognition,Computer science,Canonical correlation,Matrix decomposition,Lexicon,Artificial intelligence,Spectral method,Factorization,Natural language processing,Small set,Spoken language,Approximation error |
DocType | Volume | Citations |
Conference | P15-2 | 5 |
PageRank | References | Authors |
0.43 | 11 | 4 |
Name | Order | Citations | PageRank |
---|---|---|---|
Young-Bum Kim | 1 | 112 | 13.60 |
Karl Stratos | 2 | 328 | 21.07 |
Xiaohu Liu | 3 | 18 | 2.41 |
Ruhi Sarikaya | 4 | 698 | 64.49 |