Title
Measuring documents similarity in large corpus using MapReduce algorithm
Abstract
Document similarity measures between documents and queries has been extensively studied in information retrieval. Measuring the similarity of documents are crucial components of many text-analysis tasks, including information retrieval, document classification, and document clustering. However, there are a growing number of tasks that require computing the similarity between two very short segments of text. There exist a large number of composed documents in a large amount of corpus. Most of them are required to compute the similarity for validation. In this paper, we propose our approach of measuring similarity between documents in large amount of corpus. For evaluation, we compare the proposed approach with other approaches previously presented by using our new MapReduce algorithm. Simulation results, on Hadoop framework, show that our new MapReduce algorithm outperforms the classical ones in term of running time performance and increases the value of the similarity.
Year
DOI
Venue
2016
10.1109/ICMCS.2016.7905587
2016 5th International Conference on Multimedia Computing and Systems (ICMCS)
Keywords
DocType
ISSN
Hadoop cluster,document similarity,MapReduce programming model,similarity measure
Conference
2472-7652
ISBN
Citations 
PageRank 
978-1-5090-5147-2
1
0.35
References 
Authors
3
3
Name
Order
Citations
PageRank
Marouane Birjali1143.57
Abderrahim Beni-Hssane210.35
Mohammed Erritali31410.03