Title
On scaling latent semantic indexing for large peer-to-peer systems
Abstract
The exponential growth of data demands scalable infrastructures capable of indexing and searching rich content such as text, music, and images. A promising direction is to combine information re-trieval with peer-to-peer technology for scalability, fault-tolerance, and low administration cost. One pioneering work along this di-rection is pSearch [32, 33]. pSearch places documents onto a peer-to- peer overlay network according to semantic vectors produced using Latent Semantic Indexing (LSI). The search cost for a query is reduced since documents related to the query are likely to be co-located on a small number of nodes. Unfortunately, because of its reliance on LSI, pSearch also inherits the limitations of LSI. (1) When the corpus is large and heterogeneous, LSI's retrieval quality is inferior to methods such as Okapi. (2) The Singular Value Decomposition (SVD) used in LSI is unscalable in terms of both memory consumption and computation time.This paper addresses the above limitations of LSI and makes the following contributions. (1) To reduce the cost of SVD, we reduce the size of its input matrix through document clustering and term selection. Our method retains the retrieval quality of LSI but is several orders of magnitude more efficient. (2) Through extensive experimentation, we found that proper normalization of semantic vectors for terms and documents improves recall by 76%. (3) To further improve retrieval quality, we use low-dimensional subvectors of semantic vectors to cluster documents in the overlay and then use Okapi to guide the search and document selection.
Year
DOI
Venue
2004
10.1145/1008992.1009014
SIGIR
Keywords
Field
DocType
document clustering,large peer-to-peer system,cluster document,low administration cost,retrieval quality,term selection,overlay network,search cost,semantic vector,latent semantic indexing,psearch places document,document selection,algorithms,singular value decomposition,performance,clustering,exponential growth,dimensionality reduction,fault tolerant
Singular value decomposition,Data mining,Dimensionality reduction,Information retrieval,Peer-to-peer,Computer science,Document clustering,Search engine indexing,Overlay,Overlay network,Scalability
Conference
ISBN
Citations 
PageRank 
1-58113-881-4
45
2.35
References 
Authors
22
3
Name
Order
Citations
PageRank
Chunqiang Tang1128775.09
Sandhya Dwarkadas23504257.31
Zhichen Xu3105766.72