(EPSRC Grant EP/F035217/1 July 2008 to June 2011)
Real-time stream data has begun to play an increasingly important role on the Internet. One of the causes for this is the proliferation of geographically-distributed stream data sources such as sensor networks, scientific instruments, pervasive computing environments and web feeds connected to the Internet. Potentially millions of users world-wide want to take advantage of the availability of this data. Therefore they require a convenient way to process real-time stream data at a global scale through applications that perform Internet-scale stream processing (ISSP). Similar to the ease of relational queries in DBMS, stream-processing systems allow users to access and manipulate distributed data streams through declarative queries. However, the scale of an Internet-wide system poses substantial challenges when it comes to providing a dependable service. Any such system must gracefully handle the failure of network links and processing hosts while managing a large pool of CPU and network resources.
For example, astronomers want to detect transient sky events, such as gamma-ray bursts, in real-time. To detect such events, they must correlate real-time image streams from geographically-distributed radio telescopes. These events only last for minutes and, after an event has been detected, instruments need to be re-aligned to focus on on-going occurrences. The left figure below shows an ISSP system, executing a query that takes images from radio telescopes, processes them in real-time and delivers data about transient anomalies to two astronomers. The processing is done by a distributed set of hosts (or data centres). The logical structure of the query implementing the application is shown on the right. The system must ensure that image data is transported reliably between processing sites. However, some data loss is acceptable, as long as no transient sky events are missed as a consequence.
The DISSP project investigates how to build dependable Internet-scale stream-processing systems for interconnecting tomorrow's pervasive sensor systems and global scientific experiments. We argue that Internet-scale stream processing needs new models for achieving dependability. Achieving dependability in this context is a significant challenge for several reasons: (1) failure will be the common case in the system. Due to its size, a fraction of Internet paths and hosts will be unavailable at any time; (2) the real-time nature of the data means that there is little time for recovery from failure; (3) a shared infrastructure, such as an ISSP system, will experience high utilisation. Consequently the additional resource demand during recovery can overload the system. The traditional wisdom of substantially over-provisioning a system to compensate for failure is infeasible in such a shared, federated platform.
Therefore we believe that we need to depart from the hard dependability guarantees of traditional DBMSs and today's stream-processing systems. Ensuring no tuple loss at all times may be feasible within a single data centre, but we cannot hope to achieve this at an Internet-scale. Instead, we explore dependability guarantees that are driven by application requirements. Many sensing applications can cope with a controlled degradation of result quality. While result quality is reduced, the system provides constant feedback to users on the achieved level of service. Feedback is expressed in a domain-specific way, e.g., by notifying a scientific user about the reduction in detection confidence of events of interest. This feedback also drives an adaptive fault-tolerance mechanism allowing the DISSP system to strategise about resource allocation in order to minimise the reduction in service quality of a maximum number of users.
The aims of the DISSP project are to:
- Investigate and develop a novel reliability model for DISSP that includes user-perceived quality of query results and provides feedback to users on quality degradation due to unmaskable network and host failures.
- Provide adaptive fault-tolerance mechanisms for DISSP systems that select appropriate strategies for maximising result quality over the lifetime of potentially long-running queries (eg over several months).
- Design, implement and evaluate a scalable prototype system for DISSP using controlled experiments on the Emulab network testbed.
- Deploy an open, global and shared platform for DISSP as a public service on the PlanetLab research network and thus to facilitate and encourage the use of DISSP across research communities.
The list of DISSP related publications can be found in the corresponding section of this website.
- Dr Peter Pietzuch, Imperial College London
- Dr Eva Kalyvianaki, Imperial College London
- Marco Fiscato, Imperial College London