Title
Characterization of real workloads of web search engines
Abstract
Search is the most heavily used web application in the world and is still growing at an extraordinary rate. Understanding the behaviors of web search engines, therefore, is becoming increasingly important to the design and deployment of data center systems hosting search engines. In this paper, we study three search query traces collected from real world web search engines in three different search service providers. The first part of our study is to uncover the patterns hidden in the query traces by analyzing the variations, frequencies, and locality of query requests. Our analysis reveals that, contradicted to some previous studies, real-world query traces do not follow well-defined probability models, such as Poisson distribution and log-normal distribution. The second part of our study is to deploy the real query traces and three synthetic traces generated using probability models proposed by other researchers on a Nutch based search engine. The measured performance data from the deployments further confirm that synthetic traces do not accurately reflect the real traces. We develop an evaluation tool that can collect performance metrics on-line with negligible overhead. The performance metrics include average response time, CPU utilization, Disk accesses, and cycles-per-instructions, etc. The third of our study is to compare the search engine with representative benchmarks, namely Gridmix, SPECweb2005, TPC-C, SPECCPU2006, and HPCC, with respect to basic architecture-level characteristics and performance metrics, such as instruction mix, processor pipeline stall breakdown, memory access latency, and disk accesses. The experimental results show that web search engines have a high percentage of load/store instructions, but have good cache/memory performance. We hope those results presented in this paper will enable system designers to gain insights on optimizing systems hosting search engines.
Year
DOI
Venue
2011
10.1109/IISWC.2011.6114193
IISWC
Keywords
Field
DocType
poisson distribution,architecture-level characteristics,measured performance data,synthetic trace,different search service provider,search query,probability model,web search engines,search service providers,search engine,web search engine,real world web search,log-normal distribution,search query traces,internet,processor pipeline stall breakdown,real workloads,memory performance,memory access latency,nutch based search engine,web application,disk access,search engines,performance metrics,instruction mix,data center system,cpu utilization,log normal distribution,service provider,system design,servers,data center,cache memory,cycles per instruction,engines,benchmark testing
Web search query,Search engine,Query expansion,CPU time,Computer science,Cache,Parallel computing,Web query classification,Real-time computing,Search analytics,Web application,Database
Conference
ISBN
Citations 
PageRank 
978-1-4577-2062-8
13
1.24
References 
Authors
10
8
Name
Order
Citations
PageRank
Huafeng Xi1131.24
Jianfeng Zhan276762.86
Zhen Jia333817.82
Xuehai Hong4172.04
Lei Wang557746.85
Lixin Zhang657145.96
SUN Ning-Hui7126897.37
Gang Lu831112.40