Title
Category-Based Infidelity Bounded Queries over Unstructured Data Streams
Abstract
We present the Caicos system that supports continuous infidelity bounded queries over a data stream, where each data item (of the stream) belongs to multiple categories. Caicos is made up of four primitives: Keywords, Queries, Data items, and Categories. A Category is a virtual entity consisting of all those data items that belong to it. The membership of a data item to a category is decided by evaluating a Boolean predicate (associated with each category) over the data item. Each data item and query in turn are associated with multiple keywords. Given a keyword query, unlike conventional unstructured data querying techniques that return the top-$(K)$ documents, Caicos returns the top-$(K)$ categories with infidelity less than the user specified infidelity bound. Caicos is designed to continuously track the evolving information present in a highly dynamic data stream. It, hence, computes the relevance of a category to the continuous keyword query using the data items occurring in the stream in the recent past (i.e., within the current "window"). To efficiently provide up-to-date answers to the continuous queries, Caicos needs to maintain the required metadata accurately. This requires addressing two subproblems: 1) Identifying the "right" metadata that needs to be updated for providing accurate results and 2) updating the metadata in an efficient manner. We show that the problem of identifying the right metadata can be further broken down into two subparts. We model the first subpart as an inequality constrained minimization problem and propose an innovative iterative algorithm for the same. The second subpart requires us to build an efficient dynamic programming-based algorithm, which helps us to find the right metadata that needs to be updated. Updating the metadata on multiple processors is a scheduling problem whose complexity is exponential in the length of the input. An approximate multiprocessor scheduling algorithm is, hence, proposed. Experimental evaluation of Caicos using real-world dynamic data shows that Caicos is able to provide fidelity close to 100 percent using 45 percent less resources than the techniques proposed in the literature. This ability of Caicos to work accurately and efficiently even in scenarios with high data arrival rates makes it suitable for data intensive application domains.
Year
DOI
Venue
2013
10.1109/TKDE.2012.200
Knowledge and Data Engineering, IEEE Transactions
Keywords
Field
DocType
data handling,dynamic programming,iterative methods,meta data,query processing,Boolean predicate,Caicos system,approximate multiprocessor scheduling algorithm,categories,category based infidelity bounded queries,data items,dynamic data stream,dynamic programming based algorithm,inequality constrained minimization problem,iterative algorithm,keyword query,keywords,metadata,queries,unstructured data querying techniques,unstructured data streams,virtual entity,Continuous query,category-based query,data stream,threshold queries
Metadata,Data mining,Metadata repository,Multiprocessor scheduling,Job shop scheduling,Computer science,Data stream,Unstructured data,Dynamic data,Artificial intelligence,Group method of data handling,Machine learning
Journal
Volume
Issue
ISSN
25
11
1041-4347
Citations 
PageRank 
References 
2
0.42
5
Authors
2
Name
Order
Citations
PageRank
Manish Bhide125319.07
Krithi Ramamritham24975936.38