Scabbard: Single Node Fault-Tolerant Stream Processing | LSDS - Large-Scale Data & Systems Group, Imperial College London

Single-node multi-core stream processing engines (SPEs) can process hundreds of millions of tuples per second. Yet making them fault-tolerant with exactly-once semantics while retaining this performance is an open challenge: due to the limited I/O bandwidth of a single node, it becomes infeasible to persist all stream data and operator state during execution. Instead, single node SPEs must rely on upstream distributed systems, such as Apache Kafka, to recover stream data after failure, which necessitates complex cluster-based deployments. This lack of built-in fault tolerance features has hindered the adoption of single node SPEs.

We describe SCABBARD, the first single-node SPE that supports exactly-once fault-tolerance semantics despite limited local I/O bandwidth. SCABBARD achieves this by integrating persistence operations with the query workload. Within the operator graph, SCABBARD determines when to persist streams based on the selectivity of operators: by persisting streams after operators that discard data, it can substantially reduce the required I/O bandwidth. As part of the operator graph, SCABBARD also supports parallel persistence operations and uses markers to decide when to discard persisted data. Persisted data volume is further reduced using workload-specific compression: SCABBARD monitors stream statistics and dynamically generates computationally efficient compression operators. Our experiments show that SCABBARD can execute stream queries that process 300 million tuples per second while recovering from failures within 500ms.

Publications