Recently, there has been much interest in building streaming-processing applications on the Internet. Commonly-cited examples include real-time processing of financial data streams, continuous monitoring of Internet paths and system loads, and querying geographically diverse sensor networks. These applications typically involve querying, processing, and delivering real-time stream data from multiple distributed data sources, such as sensor networks, network monitors, or web feeds. Queries tend to be simple but frequent, and data almost always needs to be transmitted to a location remote from the sources. While any single streaming application might not stress the network, the aggregate load imposed by the myriad applications and thousands of concurrent queries is likely to overwhelm any existing infrastructure. Therefore we need new approaches for stream query optimisation to efficiently support this new class of applications.
A Stream-Based Overlay Network (SBON) is an infrastructure that manages and optimises stream queries from multiple applications. Rather than sending data to a centralized data warehouse for processing, an SBON leverages shared Internet resources using an overlay network and runs operators on behalf of many concurrent queries. The SBON uses a decentralised algorithms for efficient, large-scale query optimisation. By abstracting away the details of stream query optimization, an SBON greatly simplifies the development of Internet-wide stream-processing applications.
Operator Placement Problem
The SBON must perform an operator placement decision, creating a mapping of operators to physical overlay nodes. This mapping should make efficient use of network resources, for example, by filtering data close to the sources. There is an inherent tension here in that a placement strategy that optimises for the global good may harm the performance of individual queries.
Relaxation Placement Algorithm
The SBON uses a decentralised algorithm for network-aware operator placement called Relaxation placement. The idea behind Relaxation placement is to find a solution in two steps. First, an unpinned operator in a query is placed using a spring relaxation technique in a virtual metric latency space, as shown in the figure on the right. After that, the solution is mapped to actual physical overlay nodes.
To compute a latency space, each SBON node maintains a coordinate, such that the Euclidean distance between two coordinates is an estimate for communication latency. Performing query optimisation in this space has the benefit of naturally capturing latencies in the network without a large measurement overhead. It also means that query optimisation algorithms can perform optimisation decisions in a continuous mathematical space.
To find a good placement in the latency space, queries are modelled using springs. As shown in the figure on the left, the spring constant equals the data rate transferred over that link and the spring extension derives from the latency. Operators are then simulated as massless bodies between springs: Pinned operators have a fixed location, whereas unpinned ones can move freely. The function minimised by this approach is the datarate-latency product. This product is the amount of data in transit in the network and thus a measure for global network usage.
Distributed Query Optimization
Our approach for placing query operators in an overlay network can be generalised to obtain a scalable method for distributed stream query optimisation. In general, a cost space is a metric space that expresses costs for query optimisation decisions. Every SBON node maintains a coordinate in the cost space, such that the Euclidean distance between two coordinates approximates the cost of sending stream data between two nodes. A simple cost space would be the pure latency space mentioned above, which expresses costs in terms of network latency.
The figure on the right shows a more complex 3-dimensional cost space that captures latency (x- and y-axes) and CPU load (z-axis) in a single metric space. The distance in the x-y plane between two nodes gives an estimate of the communication latency. The height on the z-axis is proportional to the squared CPU load on a node. A query optimizer may use such a cost space to evaluate the costs of different query plans. For example, a query plan that places an operator at node a is a poor choice because that node is overloaded.
Publications and Talks
- Peter Pietzuch, Jonathan Ledlie, Jeffrey Shneidman, Mema Roussopoulos, Matt Welsh, and Margo Seltzer. "Network-Aware Operator Placement for Stream-Processing Systems". Proceedings of the 22nd International Conference on Data Engineering (ICDE'06), Atlanta, GA, April 2006.
- Peter Pietzuch, Jonathan Ledlie, and Margo Seltzer. "Supporting Network Coordinates on PlanetLab". Proceedings of the Second Workshop on Real, Large Distributed Systems (WORLDS'05), San Francisco, CA, December 2005.
- Jeff Shneidman, Peter Pietzuch, Matt Welsh, Margo Seltzer, and Mema Roussopoulos. "A Cost-Space Approach to Distributed Query Optimization in Stream Based Overlays". Proceedings of the 1st IEEE International Workshop on Networking Meets Databases (NetDB'05), Tokyo, Japan, April 2005.
(PDF version, PDF presentation).
- Peter Pietzuch, Jeff Shneidman, Jonathan Ledlie, Matt Welsh, Margo Seltzer, and Mema Roussopoulos. "Evaluating DHT-Based Service Placement for Stream-Based Overlays". Proceedings of the 4th International Workshop on Peer-to-Peer Systems (IPTPS'05), Ithaca, New York, February 2005.
- Peter Pietzuch "Path Optimization in Stream-Based Overlay Networks". Intel Research Berkeley Seminar, November 2004.
- Peter Pietzuch, Jeffrey Shneidman, Mema Roussopoulos, Margo Seltzer, Matt Welsh "Path Optimization in Stream-Based Overlay Networks". Harvard Technical Report TR-26-04, Harvard University, November 2004.
- Peter Pietzuch "Hourglass: A Stream-Based Overlay Network for Sensor Applications". Poster, Harvard Industrial Partnership (HIP'04), Harvard University, October 2004.
- Jeffrey Shneidman, Peter Pietzuch, Jonathan Ledlie, Mema Roussopoulos, Margo Seltzer, Matt Welsh "Hourglass: An Infrastructure for Connecting Sensor Networks and Applications". Harvard Technical Report TR-21-04, Harvard University, September 2004.
- Jonathan Ledlie, Jeffrey Shneidman, Matt Welsh, Mema Roussopoulos, Margo Seltzer "Open Problems in Data Collection Networks". SIGOPS European Workshop, Leuven, Belgium, September 2004.
This work is part of the Hourglass Project.
This material is based upon work supported by the National Science Foundation under Grant No. 0330244. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.