Stream processing has witnessed an uptake in modern real-time data analytics applications, ranging from credit fraud detection to clickstream analysis. The key challenge for these applications is to process a continuous stream of input records (e.g. credit card transactions, click and search term logs) at a high rate and low latency.
The processing intent for many of the above applications can be expressed with a sliding window query. In this model, typical SQL operators, such as selection (filtering) and aggregation, are applied on a moving view over a stream of records—termed a window. As new records arrive, windows slide over time and query results are evaluated anew. For example, a sliding window query can find trending topics in web queries (Google Zeitgeist) or “liked” pages (Facebook Insights). An engine that executes such a query would have to ingest billions of user records in the course of a day; update results with every 1-2 seconds worth of new data; and report them within 1s.
Currently, stream processing engines that run such complex analytics queries rely on data parallelism over large compute clusters, reportedly scaling out to 1000s of machines to achieve high throughput. Our work explores a different mode of data parallelism: what if, instead of scaling out, we had 1000s of cores within the confines of a single machine?
GPUs have been instrumental to the evolution of data analytics engines, allowing them to scale up rather than scale out. Their architecture is an especially good match for stream processing: computations on windows that are laid out linearly in memory should benefit greatly from the ample parallelism GPUs have on offer.
The prospect of using GPUs for window-based stream processing is sufficiently attractive that we prototyped a novel framework to study it, SABER. In SABER, both the CPU and GPGPU contribute to the overall processing throughput of a query proportionally, by working on different partitions of the input streams. We found strong evidence that (1) the size of stream partitions should be independent from the window definition of a query; and (2) a scheduling mechanism that permits either processor to opportunistically work on any such partition, taking into account their relative difference in performance, can maximise the utilisation of both processors.
Remarkably, SABER achieves almost additive throughput between the CPU and GPGPU and sub-second latency, regardless of the query it executes.