Title
A characteristic study on failures of production distributed data-parallel programs
Abstract
SCOPE is adopted by thousands of developers from tens of different product teams in Microsoft Bing for daily web-scale data processing, including index building, search ranking, and advertisement display. A SCOPE job is composed of declarative SQL-like queries and imperative C# user-defined functions (UDFs), which are executed in pipeline by thousands of machines. There are tens of thousands of SCOPE jobs executed on Microsoft clusters per day, while some of them fail after a long execution time and thus waste tremendous resources. Reducing SCOPE failures would save significant resources. This paper presents a comprehensive characteristic study on 200 SCOPE failures/fixes and 50 SCOPE failures with debugging statistics from Microsoft Bing, investigating not only major failure types, failure sources, and fixes, but also current debugging practice. Our major findings include (1) most of the failures (84.5%) are caused by defects in data processing rather than defects in code logic; (2) table-level failures (22.5%) are mainly caused by programmers' mistakes and frequent data-schema changes while row-level failures (62%) are mainly caused by exceptional data; (3) 93% fixes do not change data processing logic; (4) there are 8% failures with root cause not at the failure-exposing stage, making current debugging practice insufficient in this case. Our study results provide valuable guidelines for future development of data-parallel programs. We believe that these guidelines are not limited to SCOPE, but can also be generalized to other similar data-parallel platforms.
Year
DOI
Venue
2013
10.1109/ICSE.2013.6606646
ICSE
Keywords
Field
DocType
daily web-scale data processing,debugging statistic,programmer mistakes,team working,microsoft clusters,row-level failures,debugging statistics,code logic defects,failure types,parallel programming,data-parallel program,characteristic study,table-level failures,exceptional data,microsoft bing,frequent data,c language,product teams,microsoft cluster,distributed storage data,scope failures/fixes,search ranking,program debugging,failure-exposing stage,index building,declarative sql-like queries,udf,scope job,data processing defects,failure sources,debugging practice,software fault tolerance,current debugging practice,production distributed data-parallel program failures,frequent data-schema changes,code logic,scope failure,imperative c# user-defined functions,distributed databases,advertisement display,data processing logic,query processing,sql,production,data processing,data mining,data models,debugging,indexes
SQL,Data processing,Software engineering,Computer science,Software fault tolerance,Database schema,Real-time computing,Distributed database,Root cause,Debugging,Algorithmic program debugging
Conference
Volume
Issue
ISSN
null
null
null
ISBN
Citations 
PageRank 
978-1-4673-3073-2
14
0.66
References 
Authors
19
7
Name
Order
Citations
PageRank
Sihan Li11596.35
Hucheng Zhou21459.51
Haoxiang Lin31819.29
Tian Xiao4171.36
Haibo Lin521313.44
Wei Lin61045.64
Tao Xie75978304.97