Title
Evaluating and reducing the effect of data corruption when applying bag of words approaches to medical records.
Abstract
Unlike journal corpora, which are supposed to be carefully reviewed before being published, the quality of documents in a patient record are often corrupted by misspelled words and conventional graphies or abbreviations. After a survey of the domain, the paper focuses on evaluating the effect of such corruption on an information retrieval (IR) engine. The IR system uses a classical bag of words approach, with stems as representation items and term frequency–inverse document frequency (tf–idf) as weighting schema; we pay special attention to the normalization factor. First results shows that even low corruption levels (3%) do affect retrieval effectiveness (4–7%), whereas higher corruption levels can affect retrieval effectiveness by 25%. Then, we show that the use of an improved automatic spelling correction system, applied on the corrupted collection, can almost restore the retrieval effectiveness of the engine.
Year
DOI
Venue
2002
10.1016/S1386-5056(02)00057-6
International Journal of Medical Informatics
Keywords
Field
DocType
Corruption,Information retrieval,Medical records,Spelling correction,Natural language processing
Bag-of-words model,Weighting,Normalization (statistics),Information retrieval,Computer science,Medical record,Artificial intelligence,Data Corruption,Natural language processing,Spelling,Schema (psychology),Corruption
Journal
Volume
Issue
ISSN
67
1
1386-5056
Citations 
PageRank 
References 
17
1.09
16
Authors
3
Name
Order
Citations
PageRank
P Ruch165038.72
R Baud214113.41
Antoine Geissbuhler381549.75