SEEP: Stateful Big Data Processing

As users of big data applications expect fresh results, we witness a new breed of stream processing systems that are designed to scale to large numbers of cloud-hosted machines. Such systems face new challenges:

  • to benefit from the pay-as-you-go model of cloud computing, they must scale out on demand, acquiring additional virtual machines (VMs) and parallelising operators when the workload increases;
  • failures are common with deployments on hundreds of VMs, so systems must be fault-tolerant with fast recovery times yet low per-machine overheads.

An open question is how to achieve these two goals when stream queries include stateful operators, which must be scaled out and recovered without affecting query results.

The SEEP project explores large scale stream data processing in cloud architectures. It explores new data management and system mechanisms that enable elastic data processing, provisioning more resources when computing demand increases and decommisioning them when are no longer necessary. Elasticity is one important property due to the cost-per-cycle of the cloud; equally important is fault tolerance. The system needs to recover after failures to cope with the incoming data, and the process needs to be efficient, since failures are common in large clusters.

The main challenge is to provide these two important properties when we handle stateful operators, which improve the applicability of the system to more domains and thus the applications that can benefit from it. We propose an integrated approach to scale out stateful operators while maintaining fault tolerance. In SEEP we make the internal state of operators explicit, exposing it to the system, in order to scale out operators and recover them after failures. This contribution allows us to build more complex models and applications, that can benefit now from the state externalisation.

Implementation and source code

The SEEP prototype implementation is available on GitHub.
Funder
EPSRC/BAE Systems (2011-2014)
Categories
Team
Eva Kalyvianaki (City University London, UK)
Matteo Migliavacca (University of Kent, UK)
Matthias Weidlich (Humboldt University, Germany)
Panagiotis Garefalakis (Cloudera, US)
Victoria Lopez Morales (Sainsbury's, UK)

Related Publications

Alexandros Koliousis, Matthias Weidlich, Raul Castro Fernandez, Paolo Costa, Alexander L. Wolf, and Peter Pietzuch
ACM International Conference on Management of Data (SIGMOD), 2016
San Francisco, CA, USA
Raul Castro Fernandez, Panagiotis Garefalakis, and Peter Pietzuch
32nd IEEE International Conference on Data Engineering (ICDE), 2016
Helsinki, Finland
Raul Castro Fernandez, Peter Pietzuch, Joel Koshy, Jay Kreps, Dong Lin, Neha Narkhede, Jun Rao, Chris Riccomini, and Guozhang Wang
Biennial Conference on Innovative Data Systems Research (CIDR), 2015
Asilomar, CA, USA
Raul Castro Fernandez, Matteo Migliavacca, Evangelia Kalyvianaki, and Peter Pietzuch
USENIX Annual Technical Conference (ATC), 2014
Philadelphia, PA, USA
Raul Castro Fernandez, Matthias Weidlich, Peter Pietzuch, and Avigdor Gal
8th ACM International Conference on Distributed Event Based Systems (DEBS), 2014
Mumbai, India
Raul Castro Fernandez, Matteo Migliavacca, Evangelia Kalyvianaki, and Peter Pietzuch
ACM International Conference on Management of Data (SIGMOD), 2013
New York, NY
Matteo Migliavacca, David M. Eyers, Jean Bacon, Ioannis Papagiannis, Brian Shand, and Peter Pietzuch
ACM/IFIP/USENIX 11th International Middleware Conference (Middleware), 2010
Bangalore, India