Title
Toward Parallel Document Clustering
Abstract
A key challenge to automated clustering of documents in large text corpora is the high cost of comparing documents in a multi-million dimensional document space. The Anchors Hierarchy is a fast data structure and algorithm for localizing data based on a triangle inequality obeying distance metric, the algorithm strives to minimize the number of distance calculations needed to cluster the documents into "anchors'' around reference documents called "pivots''. We extend the original algorithm to increase the amount of available parallelism and consider two implementations: a complex data structure which affords efficient searching, and a simple data structure which requires repeated sorting. The sorting implementation is integrated with a text corpora "Bag of Words'' program and initial performance results of end-to-end document processing workflow are reported.
Year
DOI
Venue
2011
10.1109/IPDPS.2011.327
IPDPS Workshops
Keywords
Field
DocType
end-to-end document processing workflow,localizing data,distance metric,parallel document clustering,large text corpus,simple data structure,distance calculation,complex data structure,fast data structure,multi-million dimensional document space,original algorithm,data structure,complex data,bag of words,triangle inequality,semantics,processing,sorting,clustering algorithms,concurrent computing,document clustering,parallel algorithms,synchronization,implementation,document processing,indexes,parallel processing,performance,algorithms,data structures
Bag-of-words model,Data structure,Parallel algorithm,Computer science,Document clustering,Document processing,Metric (mathematics),Sorting,Theoretical computer science,Cluster analysis
Conference
Citations 
PageRank 
References 
1
0.36
5
Authors
2
Name
Order
Citations
PageRank
Jace A. Mogill110.36
David J. Haglin211219.45