Title
Tiered data management system: Accelerating data processing on HPC systems
Abstract
The explosion of scientific data generated from large-scale simulations and advanced sensors makes scientific workflows more complex and more data-intensive. Supporting these data-intensive workflows on high-performance computing systems presents new challenges in data management due to their scales, coordination behaviours, and overall complexities. In this paper, we propose Tiered Data Management System (TDMS) to accelerate scientific workflows on HPC systems. TDMS prevent repetitive data movement by providing efficient data sharing on top of tiered storage architecture. The customized data management for common workflow access patterns allows users to make full use of the advantages of different storage tiers. The extended application interface, which supports user-defined data management strategies, strengthens its ability to handle diverse storage architectures and application scenarios. Moreover, we propose a data-aware task scheduling module to launch tasks on compute nodes where the data locality of required data can be leveraged maximally. We build a prototype and deploy it on a typical HPC system. We evaluate the performance of TDMS with realistic workflows and the experiments show that the TDMS can optimize the I/O performance and provide up to 1.54x speedup for data-intensive workflows compared with Lustre file system.
Year
DOI
Venue
2019
10.1016/j.future.2019.07.046
Future Generation Computer Systems
Keywords
Field
DocType
HPC,Big data,Scientific workflows,Data management
Data processing,Locality,Scheduling (computing),Computer science,Data sharing,Lustre (file system),Data management,Workflow,Distributed computing,Speedup
Journal
Volume
ISSN
Citations 
101
0167-739X
0
PageRank 
References 
Authors
0.34
0
4
Name
Order
Citations
PageRank
Cheng Peng134328.39
Yutong Lu230753.61
Yunfei Du37214.62
Zhiguang Chen47918.83