As datacenter applications grow in number and complexity,
datacenter-internal service latency requirements are dropping into the
microsecond range. Providing consistent microsecond-scale service
latencies at increasing datacenter utilization is difficult,
especially at scale, where failures are common. Operating system
functionality on the service critical path often incurs high,
millisecond-scale overhead, and introduces even longer queueing delay
as utilization increases and during fail-over. My research aims to
dramatically lower service latencies under rising utilization by
co-designing hardware and operating system functionality to remove
these overheads from the critical path, even when failures are
common.
My recent focus has been on building low latency and available storage
systems. The adoption of low latency persistent memory modules (PMMs)
in datacenter servers upends the long-established model of remote
storage for distributed file systems. Instead, by colocating
computation with PMM storage we can provide applications with much
lower IO and application failover latencies, while offering strong
consistency. I present Assise, a new distributed file system, based on
a persistent, replicated coherence protocol that manages client-local
PMM as a linearizable and crash-recoverable cache between applications
and slower (and possibly remote) storage. Assise maximizes locality
for all file IO by carrying out IO on process-local, socket-local, and
client-local PMM whenever possible. Assise minimizes coherence
overhead by maintaining consistency at IO operation granularity,
rather than at fixed block sizes. Assise improves IO latency,
throughput, and fail-over time by an order of magnitude versus the
state-of-the-art, while providing stronger consistency semantics. I
finish with an overview of further research in this space, and an
outlook to impending energy constraints of large scale systems,
leading to a future research agenda in energy-resilient system design.
Please email for a
Zoom link