Title
YAFIM: A Parallel Frequent Itemset Mining Algorithm with Spark
Abstract
The frequent itemset mining (FIM) is one of the most important techniques to extract knowledge from data in many real-world applications. The Apriori algorithm is the widely-used algorithm for mining frequent itemsets from a transactional dataset. However, the FIM process is both data-intensive and computing-intensive. On one side, large scale data sets are usually adopted in data mining nowadays, on the other side, in order to generate valid information, the algorithm needs to scan the datasets iteratively for many times. These make the FIM algorithm very time-consuming over big data. The parallel and distributed computing is effective and mostly-used strategy for speeding up large scale dataset algorithms. However, the existing parallel Apriori algorithms implemented with the MapReduce model are not efficient enough for iterative computation. In this paper, we proposed YAFIM (Yet Another Frequent Itemset Mining), a parallel Apriori algorithm based on the Spark RDD framework -- a specially-designed in-memory parallel computing model to support iterative algorithms and interactive data mining. Experimental results show that, compared with the algorithms implemented with MapReduce, YAFIM achieved 18× speedup in average for various benchmarks. Especially, we apply YAFIM in a real-world medical application to explore the relationships in medicine. It outperforms the MapReduce method around 25 times.
Year
DOI
Venue
2014
10.1109/IPDPSW.2014.185
IPDPS Workshops
Keywords
Field
DocType
data-intensive,spark rdd framework,fim process,mapreduce,medical application,in-memory parallel computing model,large scale data sets,frequent itemset mining,transactional dataset,parallel apriori algorithm,interactive data mining,computing-intensive,parallel frequent itemset mining algorithm,large scale dataset algorithms,yafim,real-world applications,apriori algorithm,parallel algorithms,spark,frequent itemset mining, apriori algorithm, parallel computing, spark, medical application,iterative algorithms,distributed computing,data mining,knowledge extraction,yet another frequent itemset mining,parallel computing,iterative methods,clustering algorithms,algorithm design and analysis,classification algorithms,computational modeling
Data mining,Data set,Spark (mathematics),Computer science,GSP Algorithm,Apriori algorithm,FSA-Red Algorithm,Big data,Computation,Speedup
Conference
Citations 
PageRank 
References 
25
0.98
14
Authors
4
Name
Order
Citations
PageRank
Hongjian Qiu1250.98
Rong Gu211017.77
Chunfeng Yuan341830.84
Huang, Yihua416722.07