Opening the Network Black Box for Data-Intensive Computing
Lukas Rupprecht, Imperial College London
The ongoing explosion of data has required new technologies to efficiently extract value from the ever-increasing amounts. Data-parallel frameworks such as Apache Hadoop or Spark meet these demands by employing a shared-nothing architecture which allows them to scale processing to clusters of thousands of nodes. As data and cluster sizes grow, queries involve larger data transfers and more complex traffic patterns but, despite this heavy use of the network, applications are still treating the network as a black box. This leads to suboptimal network utilization because the special demands of different queries on the network cannot be conveyed to the networking layer. As a result, query performance suffers.
In this talk, I introduce three main design principles to improve the interaction of data-parallel frameworks with the network: awareness, adaptivity, and avoidance. I propose to exploit properties of the networking fabric, e.g. processing capacities or the topology, to reduce the amount of traffic during queries and make data-parallel frameworks *aware* of the network. I then extend one framework to monitor network metrics and use this information to *adapt* to network access skew that can occur when multiple applications compete for network resources in large data centers. Finally, I will show that unnecessary network transfers can be *avoided* by carefully optimizing how data-parallel frameworks perform I/O.