Title
Optimal Distance Bounds on Time-Series Data
Abstract
Most data mining operations include an integral search com- ponent at their core. For example, the performance of sim- ilarity search or classification based on Nearest Neighbors is largely dependent on the underlying compression and distance estimation techniques. As data repositories grow larger, there is an explicit need not only for storing the data in a compressed form, but also for facilitating mining opera- tions directly on the compressed data. Naturally, the quality or tightness of the estimated distances on the compressed ob- jects directly affects the search performance. We motivate our work within the setting of search en- gine weblog repositories, where keyword demand trends over time are represented and stored as compressed time- series data. Search and analysis over such sequence data has important applications for the search engines, including dis- covery of important news events, keyword recommendation and efficient keyword-to-advertisement mapping. We present new mechanisms for very fast search opera- tions over the compressed time-series data, with specific fo- cus on weblog data. An important contribution of this work is the derivation of optimally tight bounds on the Euclidean distance estimation between compressed sequences. Since our methodology is applicable to sequential data in general, the proposed technique is of independent interest. Addition- ally, our distance estimation strategy is not tied to a specific compression methodology, but can be applied on top of any orthonormal based compression technique (Fourier, Wavelet, PCA, etc). The experimental results indicate that the new op- timal bounds lead to a significant improvement in the prun- ing power of search compared to previous state-of-the-art, in many cases eliminating more than 80% of the candidate search sequences.
Year
Venue
Keywords
2009
SDM
search engine,nearest neighbor,euclidean distance,time series data,data mining
Field
DocType
Citations 
R-tree,Minkowski distance,Search engine,Pattern recognition,Computer science,Euclidean distance,Jaro–Winkler distance,Mahalanobis distance,Distance matrix,Artificial intelligence,Machine learning,Nearest neighbor search
Conference
7
PageRank 
References 
Authors
0.57
22
3
Name
Order
Citations
PageRank
Michail Vlachos11899113.39
Suleyman S. Kozat221215.94
Philip S. Yu3306703474.16