Title
Efficient Parallelization of the Google Trigram Method for Document Relatedness Computation.
Abstract
Finding pair wise document relatedness plays an important role in a variety of Natural Language Processing problems. Google Trigram Method (GTM) is one of the corpus-based unsupervised method that can be used to capture word relatedness and document relatedness. It has been shown that it is possible to apply GTM to construct high quality document relatedness applications. However, there are challenges in implementing GTM for pair-wise document relatedness computation on a large volume of document set given its high computational complexity. This paper presents time and space efficient methods for the computation of pair-wise document relatedness using GTM. In order to improve the performance algorithmic engineering, data structure enhancement, and parallel computing methods are applied. Two parallel methods are discussed in this paper: shared memory multicore implementation and distributed memory Hadoop implementation. Both parallel methods provide an order of magnitude improvement in accelerating the pair-wise document relatedness computation using GTM.
Year
DOI
Venue
2015
10.1109/ICPPW.2015.42
ICPP Workshops
Keywords
Field
DocType
distributed memory Hadoop implementation,shared memory multicore implementation,corpus-based unsupervised method,GTM,Google Trigram method,natural language processing,pair wise document relatedness,document relatedness computation,parallelization
Data structure,Shared memory,Computer science,Trigram,Parallel processing,Parallel computing,Distributed memory,Multi-core processor,Distributed computing,Computational complexity theory,Computation
Conference
ISSN
Citations 
PageRank 
1530-2016
0
0.34
References 
Authors
10
7
Name
Order
Citations
PageRank
Xinxin Kou100.34
Jie Mei213.06
Zhimin Yao311.09
Andrew Rau-chaplin463861.65
Aminul Islam532831.16
Abidalrahman Moh'd6388.92
Evangelos E. Milios729041.22