Distributed dataflow systems, such as Flink and Spark, provide high-level APIs to users, and they take care of the low-level details of executing the programs in a scalable way on a cluster of machines. These systems work well for simple programs, which are straightforward to express by just a few of the system-provided parallel operations. However, modern data analytics often demands the composition of larger programs, where
1) parallel operations are surrounded by control flow statements (e.g., in iterative algorithms, such as PageRank or K-means clustering), and/or
2) parallel operations are nested into each other.
In such cases, an unpleasant trade-off appears: we lose either performance or ease-of-use. I will talk about how we solved this trade-off for the case of control flow statements and nested parallelism.
Our system allows users to express control flow with easy-to-use, standard, imperative control flow constructs, and it compiles the program into a single dataflow job. This eliminates the job launch overhead from iteration steps, and allows for several loop optimizations.
For nested parallelism, we propose a compilation technique that flattens a nested program, i.e., creates an equivalent flat program where there is no nesting of parallel operations. Contrary to previous systems that perform flattening, we can even handle programs where there is an iterative algorithm at inner nesting levels.
Please email for a
Gábor E. Gévay is a PhD student at TU Berlin’s DIMA group, working at the intersection of databases and programming languages. His work on control flow handling in dataflow systems has won a best paper award at ICDE 2021. Prior to the PhD, he conducted research in computational number theory and in the artificial intelligence of board games.