Title
Mind the gap: large-scale frequent sequence mining
Abstract
Frequent sequence mining is one of the fundamental building blocks in data mining. While the problem has been extensively studied, few of the available techniques are sufficiently scalable to handle datasets with billions of sequences; such large-scale datasets arise, for instance, in text mining and session analysis. In this paper, we propose MG-FSM, a scalable algorithm for frequent sequence mining on MapReduce. MG-FSM can handle so-called \"gap constraints\", which can be used to limit the output to a controlled set of frequent sequences. At its heart, MG-FSM partitions the input database in a way that allows us to mine each partition independently using any existing frequent sequence mining algorithm. We introduce the notion of w-equivalency, which is a generalization of the notion of a \"projected database\" used by many frequent pattern mining algorithms. We also present a number of optimization techniques that minimize partition size, and therefore computational and communication costs, while still maintaining correctness. Our experimental study in the context of text mining suggests that MG-FSM is significantly more efficient and scalable than alternative approaches.
Year
DOI
Venue
2013
10.1145/2463676.2465285
SIGMOD Conference
Keywords
Field
DocType
partition size,scalable algorithm,large-scale frequent sequence mining,frequent sequence mining,input database,frequent pattern mining algorithm,data mining,frequent sequence,large-scale datasets,existing frequent sequence mining,text mining
Data mining,Concept mining,Data stream mining,Text mining,GSP Algorithm,Computer science,Correctness,Molecule mining,Partition (number theory),Database,Scalability
Conference
Citations 
PageRank 
References 
21
0.70
27
Authors
4
Name
Order
Citations
PageRank
Iris Miliaraki123710.40
Klaus Berberich2127168.96
Rainer Gemulla3108153.54
Spyros Zoupanos4484.42