YAFIM: A Parallel Frequent Itemset Mining Algorithm with Spark - Citegraph

Paper Info

Title
YAFIM: A Parallel Frequent Itemset Mining Algorithm with Spark

Abstract
The frequent itemset mining (FIM) is one of the most important techniques to extract knowledge from data in many real-world applications. The Apriori algorithm is the widely-used algorithm for mining frequent itemsets from a transactional dataset. However, the FIM process is both data-intensive and computing-intensive. On one side, large scale data sets are usually adopted in data mining nowadays, on the other side, in order to generate valid information, the algorithm needs to scan the datasets iteratively for many times. These make the FIM algorithm very time-consuming over big data. The parallel and distributed computing is effective and mostly-used strategy for speeding up large scale dataset algorithms. However, the existing parallel Apriori algorithms implemented with the MapReduce model are not efficient enough for iterative computation. In this paper, we proposed YAFIM (Yet Another Frequent Itemset Mining), a parallel Apriori algorithm based on the Spark RDD framework -- a specially-designed in-memory parallel computing model to support iterative algorithms and interactive data mining. Experimental results show that, compared with the algorithms implemented with MapReduce, YAFIM achieved 18× speedup in average for various benchmarks. Especially, we apply YAFIM in a real-world medical application to explore the relationships in medicine. It outperforms the MapReduce method around 25 times.

Year	DOI	Venue
2014	10.1109/IPDPSW.2014.185	IPDPS Workshops
Keywords	Field	DocType
data-intensive,spark rdd framework,fim process,mapreduce,medical application,in-memory parallel computing model,large scale data sets,frequent itemset mining,transactional dataset,parallel apriori algorithm,interactive data mining,computing-intensive,parallel frequent itemset mining algorithm,large scale dataset algorithms,yafim,real-world applications,apriori algorithm,parallel algorithms,spark,frequent itemset mining, apriori algorithm, parallel computing, spark, medical application,iterative algorithms,distributed computing,data mining,knowledge extraction,yet another frequent itemset mining,parallel computing,iterative methods,clustering algorithms,algorithm design and analysis,classification algorithms,computational modeling	Data mining,Data set,Spark (mathematics),Computer science,GSP Algorithm,Apriori algorithm,FSA-Red Algorithm,Big data,Computation,Speedup	Conference
Citations	PageRank	References
25	0.98	14
Authors
4

Authors (4 rows)

Cited by (25 rows)

References (14 rows)

Name	Order	Citations	PageRank
Hongjian Qiu	1	25	0.98
Rong Gu	2	110	17.77
Chunfeng Yuan	3	418	30.84
Huang, Yihua	4	167	22.07

1