Title
Input selection for fast feature engineering
Abstract
The application of machine learning to large datasets has become a vital component of many important and sophisticated software systems built today. Such trained systems are often based on supervised learning tasks that require features, signals extracted from the data that distill complicated raw data objects into a small number of salient values. A trained system's success depends substantially on the quality of its features. Unfortunately, feature engineering-the process of writing code that takes raw data objects as input and outputs feature vectors suitable for a machine learning algorithm-is a tedious, time-consuming experience. Because “big data” inputs are so diverse, feature engineering is often a trial-and-error process requiring many small, iterative code changes. Because the inputs are so large, each code change can involve a time-consuming data processing task (over each page in a Web crawl, for example). We introduce Zombie, a data-centric system that accelerates feature engineering through intelligent input selection, optimizing the “inner loop” of the feature engineering process. Our system yields feature evaluation speedups of up to 8× in some cases and reduces engineer wait times from 8 to 5 hours in others.
Year
DOI
Venue
2016
10.1109/ICDE.2016.7498272
2016 IEEE 32nd International Conference on Data Engineering (ICDE)
Keywords
Field
DocType
input selection,fast feature engineering,machine learning,supervised learning tasks,Big Data,data processing task,Zombie data-centric system
Data mining,Semi-supervised learning,Computer science,Feature model,Feature engineering,Artificial intelligence,Feature vector,Feature (computer vision),Feature extraction,Supervised learning,Machine learning,Database,Feature learning
Conference
ISSN
Citations 
PageRank 
1084-4627
7
0.45
References 
Authors
31
2
Name
Order
Citations
PageRank
Michael Anderson112519.21
Michael J. Cafarella22246144.15