Pollux: towards scalable distributed real-time search on microblogs - Citegraph

Paper Info

Title
Pollux: towards scalable distributed real-time search on microblogs

Abstract
The last few years have witnessed a meteoric rise of microblogging platforms, such as Twitter and Tumblr. The sheer volume of the microblog data and its highly dynamic nature present unique technical challenges for the platforms that provide search services. In particular, the search service must provide real-time response to queries, and continuously update the results as new microblogs are posted. Conventional approaches either cannot keep up with the high update rate, or cannot scale well to handle the large volume of data. We propose Pollux, a system that provides distributed real-time indexing and search service on microblogs. It adopts the distributed stream processing paradigm advocated by the recently developed platforms that are designed for real-time processing of large volume of data, such as Apache S4 and Twitter Storm. Although those open-source platforms have found successful applications in production environments, they lack some critical features required for real-time search. In particular: (1) they only implement partial fault tolerance, and do not provide lossless recovery in the event of a node failure, and (2) they do not have a facility for storing global data, which is necessary in efficiently ranking search results. Addressing those problems, Pollux extends current platforms in two important ways. First, we propose a failover strategy that can ensure high system availability and no data/state loss in the event of a node failure. Second, Pollux adds a global storage facility that supports convenient, efficient, and reliable data storage for shared data. We describe how to apply Pollux to the task of real-time search. We implement Pollux based on Apache S4, and show through extensive experiments on a Twitter dataset that the proposed solutions are effective, and Pollux can achieve excellent scalability.

Year	DOI	Venue
2013	10.1145/2452376.2452416	EDBT
Keywords	Field	DocType
microblog data,real-time search,ranking search result,global data,apache s4,shared data,reliable data storage,node failure,large volume,search service,distributed processing,fault tolerance,microblog	Failover,Data mining,Social media,Ranking,Computer science,Data stream,Microblogging,Search engine indexing,Fault tolerance,Stream processing,Database,Scalability	Conference
Citations	PageRank	References
3	0.38	23
Authors
3

Authors (3 rows)

Cited by (3 rows)

References (23 rows)

Name	Order	Citations	PageRank
Liwei Lin	1	122	28.76
Xiaohui Yu	2	869	64.75
Nick Koudas	3	6424	566.00

1