Missing values: how many can they be to preserve classification reliability? - Citegraph

Paper Info

Title
Missing values: how many can they be to preserve classification reliability?

Abstract
Using five medical datasets we detected the influence of missing values on true positive rates and classification accuracy. We randomly marked more and more values as missing and tested their effects on classification accuracy. The classifications were performed with nearest neighbour searching when none, 10, 20, 30% or more values were missing. We also used discriminant analysis and naïve Bayesian method for the classification. We discovered that for a two-class dataset, despite as high as 20---30% missing values, almost as good results as with no missing value could still be produced. If there are more than two classes, over 10---20% missing values are probably too many, at least for small classes with relatively few cases. The more classes and the more classes of different sizes, a classification task is the more sensitive to missing values. On the other hand, when values are missing on the basis of actual distributions affected by some selection or non-random cause and not fully random, classification can tolerate even high numbers of missing values for some datasets.

Year	DOI	Venue
2013	10.1007/s10462-011-9282-2	Artif. Intell. Rev.
Keywords	Field	DocType
Medical data,Missing values,Distance measures,Imputation,Classification,Nearest neighbour searching	Data mining,Nearest neighbour,Pattern recognition,Naive Bayes classifier,Computer science,Artificial intelligence,Linear discriminant analysis,Missing data,Imputation (statistics),Distance measures	Journal
Volume	Issue	ISSN
40	3	0269-2821
Citations	PageRank	References
3	0.37	9
Authors
2

Authors (2 rows)

Cited by (3 rows)

References (9 rows)

Name	Order	Citations	PageRank
Martti Juhola	1	456	63.94
Jorma Laurikkala	2	345	24.82

1