Title
Discovering all most specific sentences
Abstract
Data mining can be viewed, in many instances, as the task of computing a representation of a theory of a model or a database, in particular by finding a set of maximally specific sentences satisfying some property. We prove some hardness results that rule out simple approaches to solving the problem.The a priori algorithm is an algorithm that has been successfully applied to many instances of the problem. We analyze this algorithm, and prove that is optimal when the maximally specific sentences are "small". We also point out its limitations.We then present a new algorithm, the Dualize and Advance algorithm, and prove worst-case complexity bounds that are favorable in the general case. Our results use the concept of hypergraph transversals. Our analysis shows that the a priori algorithm can solve the problem of enumerating the transversals of a hypergraph, improving on previously known results in a special case. On the other hand, using results for the general case of the hypergraph transversal enumeration problem, we can show that the Dualize and Advance algorithm has worst-case running time that is sub-exponential to the output size (i.e., the number of maximally specific sentences).We further show that the problem of finding maximally specific sentences is closely related to the problem of exact learning with membership queries studied in computational learning theory.
Year
DOI
Venue
2003
10.1145/777943.777945
ACM Trans. Database Syst.
Keywords
Field
DocType
hypergraph transversal enumeration problem,data mining,special case,minimal keys,maximal frequent sets,exact learning,new algorithm,association rules,worst-case complexity bound,learning with membership queries,general case,hypergraph transversals,maximally specific sentence,Advance algorithm
Data mining,Computer science,Association rule learning,Artificial intelligence,Natural language processing
Journal
Volume
Issue
ISSN
28
2
0362-5915
Citations 
PageRank 
References 
120
4.14
31
Authors
6
Search Limit
100120
Name
Order
Citations
PageRank
Dimitrios Gunopulos17171715.85
Roni Khardon21068133.16
Heikki Mannila365951495.69
Sanjeev Saluja425979.80
Hannu Toivonen54261776.95
Ram Sewak Sharma61204.14