To execute distributed joins in parallel on compute clusters, systems partition and exchange data records between workers. With large datasets, workers spend a considerable amount of time transferring data over the network. When compute clusters are shared among multiple applications,workers must compete for network bandwidth with other applications. These variances in the available network bandwidth lead to network skew, which causes straggling workers to prolong the join completion time.
We describe SquirrelJoin, a distributed join processing technique that uses lazy partitioning to adapt to transient network skew in clusters. Workers maintain in-memory lazy partitions to withhold a subset of records, i.e. not sending them immediately to other workers for processing. Lazy partitions are then assigned dynamically to other workers based on network conditions: each worker takes periodic throughput measurements to estimate its completion time, and lazy partitions are allocated as to minimise the join completion time. We implement SquirrelJoin as part of the Apache Flink distributed dataflow framework and show that, under transient network contention in a shared compute cluster, SquirrelJoin speeds up join completion times by up to 2.9 with only a small, fixed overhead.