Title
A high-throughput, scalable solution for calculating frequent routes and profitability of New York taxis
Abstract
Processing complex queries on unbounded event streams in real-time, is a challenge for many data processing systems. These systems are expected to process data with reduced latency to generate real-time events, and at high throughput to minimize the required hardware. In this regard, Grand Challenge 2015 [6] focuses on evaluating two queries (frequent routes and profitable cells) in real-time with low latency and high throughput. These queries involve processing windows of thousands of records. Firstly, such processing demands efficient data structures and algorithms to minimize the processing overhead. Secondly, the system should partition data to evaluate them in parallel to make it scalable. In this paper, we present a set of data structures that we designed to evaluate the aforementioned queries with O(log n) time complexity and a data partitioning technique to evaluate them in parallel. We then evaluate our solution on a single machine as well as in a distributed setting in a commodity cluster of machines over a 1Gbps LAN. We were able to process the frequent routes query with the 173 million trips dataset within 5 minutes with less than 4 millisecond latency and the profitable cells query with same dataset within 11 minutes with less than 5 millisecond latency.
Year
DOI
Venue
2015
10.1145/2675743.2772589
DEBS
Field
DocType
Citations 
Data structure,Latency (engineering),Computer science,Data processing system,Real-time computing,Event stream processing,Throughput,Latency (engineering),Time complexity,Scalability,Distributed computing
Conference
1
PageRank 
References 
Authors
0.41
7
2
Name
Order
Citations
PageRank
Amila Suriarachchi110.41
Shrideep Pallickara283792.72