Title
TaintStream: fine-grained taint tracking for big data platforms through dynamic code translation
Abstract
ABSTRACTBig data has become valuable property for enterprises and enabled various intelligent applications. Today, it is common to host data in big data platforms (e.g., Spark), where developers can submit scripts to process the original and intermediate data tables. Meanwhile, it is highly desirable to manage the data to comply with various privacy requirements. To enable flexible and automated privacy policy enforcement, we propose TaintStream, a fine-grained taint tracking framework for Spark-like big data platforms. TaintStream works by automatically injecting taint tracking logic into the data processing scripts, and the injected scripts are dynamically translated to maintain a taint tag for each cell during execution. The dynamic translation rules are carefully designed to guarantee non-interference in the original data operation. By defining different semantics of taint tags, TaintStream can enable various data management applications such as access control, data retention, and user data erasure. Our experiments on a self-crafted benchmarksuite show that TaintStream is able to achieve accurate cell-level taint tracking with a precision of 93.0% and less than 15% overhead. We also demonstrate the usefulness of TaintStream through several real-world use cases of privacy policy enforcement.
Year
DOI
Venue
2021
10.1145/3468264.3468532
Foundations of Software Engineering
Keywords
DocType
Citations 
Taint tracking, big data platform, privacy compliance, GDPR
Conference
1
PageRank 
References 
Authors
0.36
20
7
Name
Order
Citations
PageRank
Chengxu Yang110.70
Yuanchun Li2375.01
Mengwei Xu3243.53
Zhenpeng Chen4356.65
Yunxin Liu569454.18
Gang Huang61223110.80
Xuanzhe Liu768957.53