Title
Parallel Training GBRT Based on KMeans Histogram Approximation for Big Data.
Abstract
Gradient Boosting Regression Tree GBRT, one of the state-of-the-art ranking algorithms widely used in industry, faces challenges in the big data era. With the rapid increase in the sizes of datasets, the iterative training process of GBRT becomes very time-consuming over large scale data. In this paper, we aim to speed up the training process of each tree in the GBRT framework. First, we propose a novel KMeans histogram building algorithm which has lower time complexity and is more efficient than the cutting-edge histogram building method. Further, we put forward an approximation algorithm by combining the kernel density estimation with the histogram technique to improve the accuracy. We conduct a variety of experiments on both the public Learning To RankLTR benchmark datasets and the large-scale real-world datasets from Baidu search engine. The experimental results show that our proposed parallel training algorithm outperforms the state-of-the-art parallel GBRT algorithm with near 2 times speedup and better accuracy. Also, our algorithm achieves the near-linear scalability.
Year
DOI
Venue
2015
10.1007/978-3-319-27122-4_4
ICA3PP
Field
DocType
Citations 
Histogram,Data mining,Approximation algorithm,k-means clustering,Learning to rank,Pattern recognition,Computer science,Artificial intelligence,Time complexity,Gradient boosting,Speedup,Kernel density estimation
Conference
0
PageRank 
References 
Authors
0.34
11
8
Name
Order
Citations
PageRank
Rong Gu111017.77
Lei Jin2100.96
Yongwei Wu366965.71
Jingying Qu400.34
Tao Wang510535.19
Xiaojun Wang6536.68
Chunfeng Yuan7172.90
Huang, Yihua816722.07