Title
On the duality of data-intensive file system design: reconciling HDFS and PVFS
Abstract
Data-intensive applications fall into two computing styles: Internet services (cloud computing) or high-performance computing (HPC). In both categories, the underlying file system is a key component for scalable application performance. In this paper, we explore the similarities and differences between PVFS, a parallel file system used in HPC at large scale, and HDFS, the primary storage system used in cloud computing with Hadoop. We integrate PVFS into Hadoop and compare its performance to HDFS using a set of data-intensive computing benchmarks. We study how HDFS-specific optimizations can be matched using PVFS and how consistency, durability, and persistence tradeoffs made by these file systems affect application performance. We show how to embed multiple replicas into a PVFS file, including a mapping with a complete copy local to the writing client, to emulate HDFS's file layout policies. We also highlight implementation issues with HDFS's dependence on disk bandwidth and benefits from pipelined replication.
Year
DOI
Venue
2011
10.1145/2063384.2063474
SC
Keywords
Field
DocType
parallel file system,file system,file layout policy,computing style,data-intensive computing benchmarks,underlying file system,high-performance computing,cloud computing,data-intensive file system design,pvfs file,application performance,reconciling hdfs,servers,layout,distributed databases,storage system,data intensive computing,semantics
File system,Computer data storage,Computer science,Server,Parallel computing,Bandwidth (signal processing),Distributed database,Operating system,Scalability,Cloud computing,The Internet,Distributed computing
Conference
Citations 
PageRank 
References 
23
1.34
16
Authors
6
Name
Order
Citations
PageRank
Wittawat Tantisiriroj11084.98
Seung Woo Son229631.43
Swapnil Patil330618.05
Samuel J. Lang4231.34
Garth Gibson525713.77
Robert Ross62717173.13