On the influence of training data quality on text document classification using machine learning methods. - Citegraph

Paper Info

Title
On the influence of training data quality on text document classification using machine learning methods.

Abstract
The main target of this paper was to study the influence of training data quality on the text document classification performance of machine learning methods. A graded relevance corpus of ten classes and 957 text documents was classified with Self-Organising Maps SOMs, learning vector quantisation, k-nearest neighbours searching, naïve Bayes and support vector machines. The relevance level of a document irrelevant, marginally, fairly or highly relevant was used as a measure of the quality of the document as a training example, which is a new approach. The classifiers were evaluated with micro-and macro-averaged classification accuracies. The results suggest that training data of higher quality should be preferred, but even low-quality data can improve a classifier, if there is plenty of it. In addition, further means to facilitate classification by the SOMs were explored. The novel set of SOM approach performed clearly better than the original SOM and comparably against supervised classification methods.

Year	DOI	Venue
2015	10.1504/IJKEDM.2015.071284	IJKEDM
Field	DocType	Volume
Data mining,One-class classification,Computer science,Text document classification,Artificial intelligence,Classifier (linguistics),Training set,Pattern recognition,Naive Bayes classifier,Support vector machine,Learning vector quantization,Linear classifier,Machine learning	Journal	3
Issue	Citations	PageRank
2	1	0.35
References	Authors
20	5

Authors (5 rows)

Cited by (1 rows)

References (20 rows)

Name	Order	Citations	PageRank
Jyri Saarikoski	1	16	3.21
Henry Joutsijoki	2	46	8.41
Kalervo Järvelin	3	4749	358.13
Jorma Laurikkala	4	345	24.82
Martti Juhola	5	456	63.94

1