Due to their scalability and low cost, object-based storage systems are an attractive storage solution and widely deployed. To gain valuable insight from the data residing in object storage but avoid expensive copying to a distributed filesystem (e.g. HDFS), it would be natural to directly use them as a storage backend for data-parallel analytics frameworks such as Spark or MapReduce. Unfortunately, executing data-parallel frameworks on object storage exhibits severe performance problems, reducing average job completion times by up to 6.5×.
We identify the two most severe performance problems when running data-parallel frameworks on the OpenStack Swift object storage system in comparison to the HDFS distributed filesystem: (i) the fixed mapping of object names to storage nodes prevents local writes and adds delay when objects are renamed; (ii) the coarser granularity of objects compared to blocks reduces data locality during reads. We propose the SwiftAnalytics object storage system to address them: (i) it uses locality-aware writes to control an object’s location and eliminate unnecessary I/O related to renames during job completion, speeding up analytics jobs by up to 5.1×; (ii) it transparently chunks objects into smaller sized parts to improve data-locality, leading to up to 3.4× faster reads.