Title
ARCADE: A prediction method for nominal variables
Abstract
The main problem considered in this paper consists of binarizing categorical (nominal) attributes having a very large number of values (20 4 in our application). A small number of relevant binary attributes are gathered from each initial attribute. Let us suppose that we want to binarize a categorical attribute v with L values, where L is large or very large. The total number of binary attributes that can be extracted from v is 2 L −1 − 1, which in the case of a large L is prohibitive. Our idea is to select only those binary attributes that are predictive; and these shall constitute a small fraction of all possible binary attributes. In order to do this, the significant idea consists in grouping the L values of a categorical attribute by means of an hierarchical clustering method. To do so, we need to define a similarity between values, which is associated with their predictive power. By clustering the L values into a small number of clusters ( J ), we define a new categorical attribute with only J values. The hierarchical clustering method used by us, AVL, allows to choose a significant value for J . Now, we could consider using all the 2 L −1 − 1 binary attributes associated with this new categorical attribute. Nevertheless, the J values are tree-structured, because we have used a hierarchical clustering method. We profit from this, and consider only about 2 × J binary attributes. If L is extremely large, for complexity and statistical reasons, we might not be able to apply a clustering algorithm directly. In this case, we start by “factorizing” v into a pair ( v 2 , v 2 ), each one with about √ L(v) values. For a simple example, consider an attribute v with only four values m 1 , m 2 , m 3 , m 4 . Obviously, in this example, there is no need to factorize the set of values of v , because it has a very small number of values. Nevertheless, for illustration purposes, v could be decomposed (factorized) into 2 attributes with only two values each; the correspondence between the values of v and ( v 2 , v 2 ) would be υ (υ 1 , υ 2 ) m 1 1 1 m 2 1 2 m 3 2 1 m 4 2 2 Now we apply the clustering method to both sets of values of v 1 and v 2 , defining therefore a new synthetic pair ( ̄v 1 , ̄v 2 ). Then, we “multiply” these new attributes and get another attribute v 10 with J × J values; J 1 (resp. J 2 ) is the number of values of ̄v 1 resp. ̄v 2 ). Now, we apply a final clustering to the values of v 10 , and proceed as above. The solution that we propose is independent of the number of classes and can be applied to various situations. The application of ARCADE to the protein secondary structure prediction problem, proves the validity of our approach. Keywords Decision trees Binarization Complexity reduction Categorical attributes Hierarchical clustering References [1] 1995 Almuallim H. Akiba Y. Kaneda S. On Handling Tree-Structured Attributes in Decision Tree Learning International Conference on Machine Learning 1995 [2] Bacelar-Nicolau H. The affinity coefficient in Cluster Analysis Methods of Operations Research vol. 53 1985 507 512 [3] Breiman L. Friedman J.H. Olshen A. Stone C.J. Classification and Regression Trees (1984) 1984 Wadsworth Belmont [4] 1992 Buntine W. Niblett T. A further comparison of splitting rules for decision-tree induction Machine Learning 8 1992 75 86 [5] Colloc'h N. Etchebest C. Thoreau E. Henrissat B. Mornon J.P. Protein Engineering 6 1993 377 382 [6] Cost S. Salzberg S. A Weighted Nearest Neighbor Algorithm for Learning with Symbolic Features Machine Learning 10 1993 57 78 [7] 1996 Costa J.F.P. Coefficients d'association et binarization par la classification hiérarchique dans les arbres de décision Application à l'identification de la structure secondaire des protéines Thèse de doctorat 1996 Université de Rennes I Belmont [8] 1995 Dougherty J. Kohavi R. Sahami M. Supervised and Unsupervised Discretization of Continuous features Armand Prieditis Stuart Russell Machine Learning: Proceedings of the Twelfth International Conference 1995 Morgan Kaufman Publishers Rennes, France [9] 1993 Fayyad U.M. Irani K.B. Multi-Interval Discretization of Continuous-Valued Attributes for Classification Learning Proc. of the IJCAI 93 , 13th International Joint C. On A.I. Chambéry, France 1993 [10] Fisher W.D. On grouping for maximumhomogeneity Journal of the American Statistical Association 53 1958 789 798 [11] Friesner R.A. Gunn J.R. Computational studies of protein folding 1996 Annual Review of Biophysics and Biomolecular Structure 25 1996 315 342 [12] Goodman L.A. Kruskal W.H. Measures of association for cross classifications Journal of the American Statistical Association 49 1954 732 764 [13] 1993 Heath D. Kasif S. Salzberg S. Induction of Oblique Decision Trees Proc. of the IJCAI 93 , 13th International Joint C. On A.I. Chambéry, France 1993 [14] Krzanowsky W.J. Discrimination and Classification Using Both Binary and Continuous Attributes Journal of The American Statistical Association Volume 70 Number 352 1975 7 [15] Lerman I.C. Classification et analyse ordinale des données 1981 Dunod Paris [16] Lerman I.C. Sur la signification des classes issues d'une classification automatique J. Felsenstein Numerical Taxonomy NATO ASI Series Vol. G1 1983 Springer-Verlag Berlin 179 198 [17] 2 Lerman I.C. Construction d'un indice de similarité entre objets décrits par des variables d'un type quelconque Application au problème du consensus en classification Revue de Statistique Appliquée xxxv 1987 39 76 [18] Lerman I.C. Foundations of the Likelihood Linkage Analysis (LLA) classification method Applied Stochastic and Data Analysis Vol. 7 1991 63 76 [19] Paris Lerman I.C. Conception et analyse de la forme limite d'une famille de coefficients statistiques d'association entre variables relationnelles I Revue Mathématique Informatique et Sciences Humaines 30è année n∘ 118 1992 35 52 [20] Paris Lerman I.C. Conception et analyse de la forme limite d'une famille de coefficients statistiques d'association entre variables relationnelles H Revue Mathématique Informatique et Sciences Humaines 30è année n∘ 119 1992 75 100 [21] Lerman I.C. Likelihood linkage analysis (LLA) classification method (Around an example treated by hand) Elsevier éditions Biochimie 75 1993 379 397 1993 [22] November 1997 Lerman I.C. Pinto da Costa J.F. Reducing the number of binary splits, in Decision Tree Induction, by means of an Hierarchical Classification n. 3312, INRIA (December 1997) Technical Report 1138 1997 IRISA Berlin 38 [23] 1991 Lerman I.C. Ghazzali N. What do we retain from a classification tree? an experiment in image coding E. Diday Y. Lechevallier Proceedings of the Conference of Versailles September 18–20, 1991 Symbolic-Numeric data analysis and learning 1991 Nova Science Publishers 27 42 [24] Lerman I.C. Peter Ph. Leredde H. Principes et calculs de la méthode implantée dans le programme CHAVL (Classification Hiérarchique par Analyse de la Vraisemblance des Liens) La Revue de Modulad no 12 1993–1994 33 70 Lerman I.C. Peter Ph. Leredde H. Principes et calculs de la méthode implantée dans le programme CHAVL (Classification Hiérarchique par Analyse de la Vraisemblance des Liens) La Revue de Modulad n 13 1993–1994 63 90 [25] Paris Lerman I.C. Tallur B. Classification des éléments constitutifs d'une juxtaposition de tableaux de contingence Revue de Statistique Appliquée n∘ 28 1980 5 28 [26] Levin J.M. Pascarella S. Argos P. Garnier J. Quantification of secondary prediction improvement using multiple alignments Prot. Engng. 6 1993 849 854 [27] Light R.J. Margolin B.H. An analysis of variance for categorical data Journal of the American Statistical Association 66 n 331 1971 [28] 1994 Liu W.Z. White A.P. The importance of Attribute Selection Measures in Decision Tree Induction 1994 Machine Learning 15 1994 25 41 [29] Müller W. Wysotzki F. A Splitting Algorithm, Based on a Statistical Approach in the Decision Tree Algorithm CAL5 7th European Conference on ML Catania, Sicily Proc. of the ECML 94 1994 April 1994 [30] April 1994 Nakhaeizadeh G. Interaction Between Machine Learning and Statistics, An Overview 7th European Conference on ML Catania, Sicily Proc. of the ECML 94 1994 [31] Nicolau F. Critérios de análise classificatöria hierárquica baseados na função de distribuição Ph.D. thesis 1981 Science Faculty of Lisbon [32] Qian N. Sejnowsky T. Predicting the Secondary Structure of Globular Proteins Using Neural Network Models Journal of Molecular Biology 202 1988 865 884 [33] Quinlan J.R. Induction of Decision Trees Machine Learning 1986 81 106 [34] 1993 Quinlan J.R. C4.5: Programs for Machine Learning 1993 Morgan Kaufman California [35] 1993 Rost B. Sander C. Prediction of protein secondary structure at better than 70% accuracy J. Mol. Biol. 232 1993 584 599 [36] Salzberg S. Cost S. A weighted nearest neighbor algorithm for learning with symbolic features Machine Learning 10 1992 57 78 [37] 1994 Solovyev V. Salamov A. Predicting a-helix and b-strand segments of globular proteins CABIOS Vol. 10 n∘ 6 1994 661 669 [38] Tallur B. Contribution à l'analyse exploratoire de tableaux de contingence par la classification Thèse de Doctorat Sciences 1988 Université de Rennes I California [39] Taylor C.C. Distance-based Decision Trees 7th European Conference on ML Catania, Sicily Proc. of the ECML 94 1994 [40] Van de Merckt T. Decision Trees in Numerical Attribute Spaces 13th International Joint C. On A.I. Chambéry, France Proc. of the IJCAI 93 1993 [41] Ward J.H. Hierarchical grouping to optimise an objective function Journal of American Statistical Association 58 1963 236 244 [42] Yi T.M. Lander E.S. Protein secondary structure prediction using nearest-neighbor methods Journal of Molecular Biology 232 1993 1117 1129 [43] Zhang X. Mesirov J. Waltz D. A hybrid system for protein secondary structure prediction Journal of Molecular Biology 25 1992 1049 1063
Year
DOI
Venue
1998
10.1016/S1088-467X(98)00031-6
Intell. Data Anal.
Keywords
Field
DocType
decision tree,tree structure,profitability,decision trees,hierarchical clustering,complexity reduction
Small number,Hierarchical clustering,Pattern recognition,Categorical variable,Reduction (complexity),Large numbers,Factorization,Artificial intelligence,Cluster analysis,Machine learning,Mathematics,Binary number
Journal
Volume
Issue
ISSN
2
1
Intelligent Data Analysis
Citations 
PageRank 
References 
0
0.34
6
Authors
2
Name
Order
Citations
PageRank
Joaquim Pinto Da Costa126214.82
Israël-César Lerman274.32