This talk gives an overview of the technology inside Apache Flink. Apache is an open source project that develops a unified approach to batch- and streaming data analysis via distributed dataflows, rooted in the Stratosphere research project.
To the user, Flink offers rich APIs for batch- and streaming- data processing programs, with flexible windowing semantics, and seamless integration with Java and Scala programs. In addition, the Flink community is actively developing a series of libraries for the different use cases, such as Batch Machine Learning, Streaming Machine Learning, Graph Analysis, and high-level language queries.
The runtime of Apache Flink is a flexible stream processing system, which optimizations to efficiently handle batch programs (finite streams) and continuous streaming programs (infinite streams). Flink implements different recovery mechanisms (rollback/restart for finite streams, distributed snapshotting for infinite streams) and scheduling mechanisms, as well as a robust custom memory management to efficiently scale to data sets larger than main memory. Flink offers deep support for iterative programs via a restricted form of cyclic dataflows and can exploit stateful computation to support machine learning and graph analysis algorithms very efficiently.