Abstract | ||
---|---|---|
Information-extraction (IE) research typically focuses on clean-text inputs. However, an IE engine serving real applications yields many false alarms due to less-well-formed input. For example, IE in a multilingual broadcast processing system has to deal with inaccurate automatic transcription and translation. The resulting presence of non-target-language text in this case, and non-language material interspersed in data from other applications, raise the research problem of making IE robust to such noisy input text. We address one such IE task: entity-mention detection. We describe augmenting a statistical mention-detection system in order to reduce false alarms from spurious passages. The diverse nature of input noise leads us to pursue a multi-faceted approach to robustness. For our English-language system, at various miss rates we eliminate 97% of false alarms on inputs from other Latin-alphabet languages. In another experiment, representing scenarios in which genre-specific training is infeasible, we process real financial-transactions text containing mixed languages and data-set codes. On these data, because we do not train on data like it, we achieve a smaller but significant improvement. These gains come with virtually no loss in accuracy on clean English text. |
Year | Venue | Keywords |
---|---|---|
2010 | EMNLP | ie engine,noisy input text,clean english text,clean-text input,false alarm,english-language system,real financial-transactions text,improving mention detection robustness,non-target-language text,input noise,ie task |
Field | DocType | Volume |
Broadcasting,Computer science,Robustness (computer science),Speech recognition,Artificial intelligence,Natural language processing,Spurious relationship,Machine learning | Conference | D10-1 |
Citations | PageRank | References |
12 | 0.80 | 22 |
Authors | ||
4 |
Name | Order | Citations | PageRank |
---|---|---|---|
Radu Florian | 1 | 924 | 91.44 |
John F. Pitrelli | 2 | 493 | 81.16 |
Salim Roukos | 3 | 6248 | 845.50 |
Imed Zitouni | 4 | 612 | 46.39 |