Tianyu Li, MIT
Providing strong fault-tolerant guarantees for the modern cloud is difficult, as application developers must coordinate between independent stateful services and stateless, ephemeral compute while handling various failure-induced anomalies. In this talk, I will describe Composable Resilient Steps (CReSt), a new abstraction for resilient cloud applications. The crux of CReSt is fault-tolerant "steps" that allow participants to receive, process, and send messages as a single uninterruptible atomic unit. Composability and reliability are orthogonally achieved by reusable CReSt implementations, for example, leveraging reliable message queues. Thus, CReSt application builders focus solely on translating application logic into steps, and infrastructure builders focus on efficient CReSt implementations. I will then discuss one such implementation, called DARQ (for Deduplicated Asynchronously Recoverable Queues). At its core, DARQ is a storage service that enforces CReSt semantics; developers attach ephemeral compute nodes to DARQ instances to implement stateful distributed components. Services built with DARQ are resilient by construction, and CReSt-compatible services naturally compose without loss of resiliency. For performance, DARQ features a novel speculative execution scheme to execute CReSt steps without waiting for message persistence, effectively eliding cloud persistence overheads. I will present prototype implementations of common cloud programming paradigms such as stream processing and resilient workflows using DARQ to showcase its generality, and benchmarking results to showcase its performance. Finally, I will outline how DARQ can be used as a foundation for the next generation programming framework for disaggregated cloud applications.
About the speaker
Tianyu Li is a 4th year PhD student at MIT's data systems group, advised by Professor Sam Madden. He is mainly interested in working on distributed and cloud systems, applying many insights and techniques from the database community to make modern cloud applications more fault-tolerant without sacrificing performance. Outside of this work, Tianyu has worked on highly concurrent key-value stores on the Microsoft FASTER project, and transaction processing/self-driving databases on CMU's NoisePage project.