Title
Distributed Decision Tree V.2.0
Abstract
Decision Tree is a state-of-the-art classification and prediction algorithm in machine learning which constructs tree-structured set of attributes. Its distributed implementation, i.e. Distributed Decision Tree generates a specified number of trees (depending upon number of partitions of input dataset) and at the end collects votes or averages the prediction or classification. Here, the overall idea of achieving parallelism depends upon number of partitions. Parallelism can be achived by proper tuning of number of partitions. However, this kind of setup in-turn leads to a problem of compromise in accuracy, because there is always a tradeoff between accuracy and size of partition. Therefore, in this paper, we have proposed an improved Distributed Decision Tree algorithm to achieve true parallelism without loss in accuracy. The improved Distributed Decision Tree is implemented using open-source distributed frameworks Hadoop and Spark. We measure learning time, size of tree and accuracy to set up benchmarking using medium to large datasets.
Year
DOI
Venue
2017
10.1109/BigData.2017.8258011
2017 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA)
Keywords
DocType
ISSN
distributed decision tree, decision tree, spark, hadoop
Conference
2639-1589
Citations 
PageRank 
References 
0
0.34
0
Authors
2
Name
Order
Citations
PageRank
Ankit Desai100.68
Sanjay Chaudhary222324.16