Abstract | ||
---|---|---|
Decision Tree is a state-of-the-art classification and prediction algorithm in machine learning which constructs tree-structured set of attributes. Its distributed implementation, i.e. Distributed Decision Tree generates a specified number of trees (depending upon number of partitions of input dataset) and at the end collects votes or averages the prediction or classification. Here, the overall idea of achieving parallelism depends upon number of partitions. Parallelism can be achived by proper tuning of number of partitions. However, this kind of setup in-turn leads to a problem of compromise in accuracy, because there is always a tradeoff between accuracy and size of partition. Therefore, in this paper, we have proposed an improved Distributed Decision Tree algorithm to achieve true parallelism without loss in accuracy. The improved Distributed Decision Tree is implemented using open-source distributed frameworks Hadoop and Spark. We measure learning time, size of tree and accuracy to set up benchmarking using medium to large datasets. |
Year | DOI | Venue |
---|---|---|
2017 | 10.1109/BigData.2017.8258011 | 2017 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA) |
Keywords | DocType | ISSN |
distributed decision tree, decision tree, spark, hadoop | Conference | 2639-1589 |
Citations | PageRank | References |
0 | 0.34 | 0 |
Authors | ||
2 |
Name | Order | Citations | PageRank |
---|---|---|---|
Ankit Desai | 1 | 0 | 0.68 |
Sanjay Chaudhary | 2 | 223 | 24.16 |