Abstract | ||
---|---|---|
In this paper, we investigate the role of a biomedical dataset on the classification accuracy of an algorithm. We quantify the complexity of a biomedical dataset using five complexity measures: correlation-based feature selection subset merit, noise, imbalance ratio, missing values and information gain. The effect of these complexity measures on classification accuracy is evaluated using five diverse machine learning algorithms: J48 (decision tree), SMO (support vector machines), Naive Bayes (probabilistic), IBk (instance based learner) and JRIP (rule-based induction). The results of our experiments show that noise and correlation-based feature selection subset merit --- not a particular choice of algorithm --- play a major role in determining the classification accuracy. In the end, we provide researchers with a meta-model and an empirical equation to estimate the classification potential of a dataset on the basis of its complexity. This well help researchers to efficiently pre-process the dataset for automatic knowledge extraction. |
Year | DOI | Venue |
---|---|---|
2009 | 10.1007/978-3-642-02976-9_51 | AIME '87 |
Keywords | Field | DocType |
automatic knowledge extraction,biomedical dataset,complexity measure,major role,diverse machine,correlation-based feature selection subset,classification accuracy,classification potential,decision tree,naive bayes,machine learning,rule based,support vector machine,feature selection,meta model,information gain,missing values,knowledge extraction | Decision tree,Data mining,Naive Bayes classifier,Feature selection,Computer science,Support vector machine,C4.5 algorithm,Artificial intelligence,Knowledge extraction,Missing data,Probabilistic logic,Machine learning | Conference |
Volume | ISSN | Citations |
5651 | 0302-9743 | 11 |
PageRank | References | Authors |
0.57 | 3 | 2 |
Name | Order | Citations | PageRank |
---|---|---|---|
Ajay Kumar Tanwani | 1 | 66 | 9.07 |
Muddassar Farooq | 2 | 1221 | 83.47 |