Title | ||
---|---|---|
Evaluating and reducing the effect of data corruption when applying bag of words approaches to medical records. |
Abstract | ||
---|---|---|
Unlike journal corpora, which are supposed to be carefully reviewed before being published, the quality of documents in a patient record are often corrupted by misspelled words and conventional graphies or abbreviations. After a survey of the domain, the paper focuses on evaluating the effect of such corruption on an information retrieval (IR) engine. The IR system uses a classical bag of words approach, with stems as representation items and term frequency–inverse document frequency (tf–idf) as weighting schema; we pay special attention to the normalization factor. First results shows that even low corruption levels (3%) do affect retrieval effectiveness (4–7%), whereas higher corruption levels can affect retrieval effectiveness by 25%. Then, we show that the use of an improved automatic spelling correction system, applied on the corrupted collection, can almost restore the retrieval effectiveness of the engine. |
Year | DOI | Venue |
---|---|---|
2002 | 10.1016/S1386-5056(02)00057-6 | International Journal of Medical Informatics |
Keywords | Field | DocType |
Corruption,Information retrieval,Medical records,Spelling correction,Natural language processing | Bag-of-words model,Weighting,Normalization (statistics),Information retrieval,Computer science,Medical record,Artificial intelligence,Data Corruption,Natural language processing,Spelling,Schema (psychology),Corruption | Journal |
Volume | Issue | ISSN |
67 | 1 | 1386-5056 |
Citations | PageRank | References |
17 | 1.09 | 16 |
Authors | ||
3 |
Name | Order | Citations | PageRank |
---|---|---|---|
P Ruch | 1 | 650 | 38.72 |
R Baud | 2 | 141 | 13.41 |
Antoine Geissbuhler | 3 | 815 | 49.75 |