Title | ||
---|---|---|
Stability selection using a genetic algorithm and logistic linear regression on healthcare records. |
Abstract | ||
---|---|---|
This paper presents a Genetic Algorithm (GA) application to measuring feature importance in machine learning (ML) from a large-scale database. Too many input features may cause over-fitting, therefore a feature selection is desirable. Some ML algorithms have feature selection embedded, e.g., lasso penalized linear regression or random forests. Others do not include such functionality and are sensitive to over-fitting, e.g., unregularized linear regression. The latter algorithms require that proper features are chosen before learning. Therefore, we propose a novel stability selection (SS) approach using GA-based feature selection. The proposed SS approach iteratively applies GA on a subsample of records and features. Each GA individual represents a binary vector of selected features in the subsample. An unregularized logistic linear regression model is then trained and tested using GA-selected features through cross-validation of the subsamples. GA fitness is evaluated by area under the curve (AUC) and optimized during a GA run. AUC is assessed with an unregularized logistic regression model on multiple-subsampled healthcare records, collected under the Healthcare Cost, and Utilization Project (HCUP), utilizing the National (Nationwide) Inpatient Sample (NIS) database. Reported results show that averaging feature importance from top-4 SS and the SS using GA (GASS), improves these AUC results. |
Year | DOI | Venue |
---|---|---|
2017 | 10.1145/3067695.3076077 | GECCO (Companion) |
Keywords | Field | DocType |
Stability Selection, Genetic Algorithm, Feature Selection, Feature Importance, Cross-validation, Logistic Generalized Linear Regression, Healthcare Cost Utility Project, Disease Risk Prediction, Healthcare Records | Data mining,Feature selection,Computer science,Lasso (statistics),Artificial intelligence,Random forest,Logistic regression,Genetic algorithm,Binary number,Linear regression,Pattern recognition,Cross-validation,Machine learning | Conference |
Citations | PageRank | References |
0 | 0.34 | 4 |
Authors | ||
4 |
Name | Order | Citations | PageRank |
---|---|---|---|
Ales Zamuda | 1 | 400 | 18.26 |
Christine Zarges | 2 | 313 | 22.66 |
Gregor Stiglic | 3 | 83 | 17.53 |
Goran Hrovat | 4 | 9 | 1.52 |