In recent years, Kubernetes has been widely adopted as an orchestration platform for automating the deployment, scaling, and management for containerized applications at scale. The recently introduced native integration of container checkpointing in Kubernetes enables dynamic relocation, scaling-out, and load-balancing of microservices as well as fast startup times, forensic analysis, and fault-tolerance of stateful applications.
In this talk we are going to discuss some of the challenges associated with checkpointing long-running stateful applications and the performance trade-offs associated with periodic checkpointing and rollback recovery. The talk will also discuss how image-less checkpoint streaming approaches can be used to address some of these challenges, as well as future research directions.