Meta-Dataflows: Efficient Exploratory Dataflow Jobs | LSDS - Large-Scale Data & Systems Group, Imperial College London

Distributed dataflow systems such as Apache Spark and Flink are used to derive new insights from large datasets. While they efficiently execute concrete workflows expressed as dataflow graphs, they lack generic support for exploratory workflows: if a user is uncertain about the correct data processing pipeline to employ, e.g. in terms of data cleaning strategy or choice of model parameters, they must repeatedly submit modified jobs to the system. Current systems therefore miss out on optimisation opportunities for exploratory workflows, both in terms of efficient cluster scheduling and memory allocation. We describe meta-dataflows (MDFs), a new model that effectively expresses exploratory workflows and efficiently executes them on compute clusters. With MDFs, users specify a family of dataflow graphs using two new primitives: (a) an explore operator automatically explores unbound options in a dataflow graph; and (b) a choose operator assesses the result quality of explored dataflow branches, selecting a subset of results. We propose techniques for a dataflow system to execute MDFs efficiently: it can (i) avoid redundant computation when exploring options by reusing intermediate results and terminating underperforming branches; and it can (ii) allocate cluster memory for intermediate results more effectively by considering future data access patterns in the MDF. Our experimental evaluation with data mining and machine learning use cases shows that MDFs improve the runtime of exploratory workflows by up to 90% compared to sequential job execution.

Publications