Title
Constructing X-of-n Attributes With A Genetic Algorithm
Abstract
http://www.ppgia.pucpr.br/~alex Abstract The predictive accuracy obtained by a classification algorithm is strongly dependent on the quality of the attributes of the data being mined. When the attributes are little relevant for predicting the class of a record, the predictive accuracy will tend to be low. To combat this problem, a natural approach consists of constructing new attributes out of the original attributes. Many attribute construction algorithms work by simply constructing conjunctions and/or disjunctions of attribute -value pairs. This kind of representatio n has a limited expressiveness power to represent attribute interactions. A more expressive representation is X-of-N (Zheng 1995). An X- of-N condition consists of a set of N attribute-value pairs. The value of an X-of-N condition for a given example (record) is the number of attribute -value pairs of the example that match with the N attribute-value pairs of the condition. For instance, consider the following X-of-N condition: X-of-{"Sex = male", "Age < 21", "Salary = high"}. Suppose that a given example has the following attribute-value pairs: {"Sex = male", "Age = 51", "Salary = high"}. This example has 2 out of the 3 attribute-value pairs of the X-of-N condition, so that the value of the X- of-N condition for this example is 2. In our GA an individual represents a X-of-N attribute, i.e. the set of N attribute-value pairs composing a X-of-N attribute. Each attribute-value pair is of the form Ai = Vij, where Ai is the i-th attribute and Vij is the j-th value belonging to the domain of the Ai. The current version of our GA can cope only with categorical attributes. (Continuous attributes are discretized in a preprocessing step.) The value of N is an integer number varying from 2 to 7. The fitness function is the information gain ratio of the constructed attribute. In order to evaluate how good the new attributes constructed by the GA are, we have compared the performance of the C4.5 algorithm using only the original attributes with the performance C4.5 using both the original attributes and the new attributes constructed by the GA. Hereafter we refer to the former and to the latter as the original data set and the extended data set, respectively. The performance of C4.5 in both the original data set and the extended data set was measured with respect to the classification error rate. The experiments were done by using public-domain data sets available from http://www.ics.uci.edu/~mlearn/MLRepository.html. The results are shown in Table 1. The results for the first four data sets of Table 1 were produced by a 10-fold cross-validation procedure. The results for the last three data sets (the monks data sets) were obtained by using the predefined partition of the data into training and test sets. The second and third columns of Table 1 show the error rate obtained by C4.5 in the original data set and the extended data set (with the new X-of-N attributes), respectively. The numbers after the symbol " ?" denote standard deviations. For each data set, the difference in the error rates of the second and third columns is deemed to be significant when the two error rate intervals (taking into account the standard deviations) do not overlap. When the error rate of the "original + X-of-N" attributes is significantly better (worse) than the error rate of the "original" attributes, there is a "(+)" (("-")) sign in the third column. Note that the X-of-N attribute constructed by the GA significantly improved the performance of C4.5 in three data sets (tic-tac-toe, promoters and monks- 2), and it significantly degraded the performance of C4.5 in just one data set (monks-3). In the other three data sets the difference in the error rates was not significant. Table 1: Error rate obtained by C4.5 in seven data sets
Year
Venue
Keywords
2002
GECCO Late Breaking Papers
constructing x-of-n attributes,constructive induction,genetic algorithm,x-of-n attributes,data mining.,fitness function,data mining,cross validation,expressive power,standard deviation,error rate,public domain,computer programming,information gain
DocType
ISBN
Citations 
Conference
1-55860-878-8
9
PageRank 
References 
Authors
0.61
9
3
Name
Order
Citations
PageRank
Otavio Larsen190.61
Alex Alves Freitas2188696.25
Júlio C. Nievola3758.28