Accessing and updating data sharded across distributed machines safely and speedily in the face of failures remains a challenging problem. Most prominently, applications that share state across different nodes want their writes to quickly become visible to others, without giving up recoverability guarantees in case a failure occurs. Current solutions of a fast cache backed by storage cannot support this use case easily. In this work, we design a distributed protocol, called Distributed Prefix Recovery (DPR) that builds on top of a sharded cache-store architecture with single-key operations, to provide cross-shard recoverability guarantees. With DPR, many clients can read and update shared state at sub-millisecond latency, while receiving periodic prefix durability guarantees. On failure, DPR quickly restores the system to a prefix-consistent state with a novel non-blocking rollback scheme. In this talk, I will discuss the details of the DPR algorithm and briefly cover our ongoing work to make DPR more usable for a broader audience
Please email for a
Tianyu is a third-year PhD student at MIT, advised by Sam Madden. His research focuses on developing new fault-tolerant
schemes optimized for the modern cloud workload. Such schemes must have low-overhead to support decomposing applications into fine-grained execution slices (e.g., on a serverless worker) without compromising guarantees or performance in the common case. Before
joining MIT, he obtained a BS and MS from CMU, advised by Andy Pavlo.