Abstract | ||
---|---|---|
We propose a new algorithm for building decision tree classifiers. The algorithm is executed in a distributed environment and is especially designed for classifying large data sets and streaming data. It is empirically shown to be as accurate as a standard decision tree classifier, while being scalable for processing of streaming data on multiple processors. These findings are supported by a rigorous analysis of the algorithm's accuracy. The essence of the algorithm is to quickly construct histograms at the processors, which compress the data to a fixed amount of memory. A master processor uses this information to find near-optimal split points to terminal tree nodes. Our analysis shows that guarantees on the local accuracy of split points imply guarantees on the overall tree accuracy. |
Year | DOI | Venue |
---|---|---|
2010 | 10.1145/1756006.1756034 | Journal of Machine Learning Research |
Keywords | Field | DocType |
scalability,standard decision tree classifier,local accuracy,streaming parallel decision tree,split point,near-optimal split point,new algorithm,decision tree classifier,overall tree accuracy,rigorous analysis,decision tree classifiers,large data set,streami ng data,distributed computing,terminal tree node,decision tree,distributed environment | Data mining,Tree traversal,Computer science,C4.5 algorithm,Artificial intelligence,Segment tree,ID3 algorithm,Fractal tree index,Machine learning,Interval tree,Incremental decision tree,Search tree | Journal |
Volume | ISSN | Citations |
11, | 1532-4435 | 54 |
PageRank | References | Authors |
1.77 | 20 | 2 |
Name | Order | Citations | PageRank |
---|---|---|---|
Y. Ben-Haim | 1 | 140 | 8.29 |
Elad Tom-Tov | 2 | 54 | 1.77 |