Title
HistoSketch: Fast Similarity-Preserving Sketching of Streaming Histograms with Concept Drift
Abstract
Histogram-based similarity has been widely adopted in many machine learning tasks. However, measuring histogram similarity is a challenging task for streaming data, where the elements of a histogram are observed in a streaming manner. First, the ever-growing cardinality of histogram elements makes any similarity computation inefficient. Second, the concept-drift issue in the data streams also impairs the accurate assessment of the similarity. In this paper, we propose to overcome the above challenges with HistoSketch, a fast similarity-preserving sketching method for streaming histograms with concept drift. Specifically, HistoSketch is designed to incrementally maintain a set of compact and fixed-size sketches of streaming histograms to approximate similarity between the histograms, with the special consideration of gradually forgetting the outdated histogram elements. We evaluate HistoSketch on multiple classification tasks using both synthetic and real-world datasets. The results show that our method is able to efficiently approximate similarity for streaming histograms and quickly adapt to concept drift. Compared to full streaming histograms gradually forgetting the outdated histogram elements, HistoSketch is able to dramatically reduce the classification time (with a 7500x speedup) with only a modest loss in accuracy (about 3.5%).
Year
DOI
Venue
2017
10.1109/ICDM.2017.64
2017 IEEE International Conference on Data Mining (ICDM)
Keywords
Field
DocType
Similarity-Preserving Sketching,Histograms,Streaming Data,Concept Drift,Consistent Weighted Sampling
Forgetting,Data mining,Histogram,Data stream mining,Similarity computation,Computer science,Cardinality,Concept drift,Streaming data,Speedup
Conference
ISSN
ISBN
Citations 
1550-4786
978-1-5386-2449-4
3
PageRank 
References 
Authors
0.39
24
4
Name
Order
Citations
PageRank
Dingqi Yang154228.79
Bin Li269450.02
Laura Rettig340.74
o de troyer41708134.92