Systems for training massive deep learning models (billions of parameters) often assume and require specialized "hyper-clusters": hundreds or thousands of GPUs wired with specialized high-bandwidth interconnects such as NV-Link and Infiniband. Besides being expensive, such dependence on hyper-clusters and custom high-speed inter-connects limits the size of such clusters, creating (a) scalability limits on job parallelism; (b) resource fragmentation across hyper-clusters.
This talk presents Varuna, a system to train massive deep learning models on commodity networking. Varuna makes thrifty use of networking resources and automatically configures the user's training job to efficiently use any given set of resources. Therefore, Varuna is able to leverage "low-priority" VMs that cost about 5x cheaper than dedicated GPUs, thus significantly reducing the cost of training massive models. We demonstrate the efficacy of Varuna by training massive models, including a 200 billion parameter model, on 5x cheaper "spot VMs", while maintaining high training throughput.
Please email for a
Nitika is a PhD student at Cornell University, interested broadly in distributed systems and networking. Previously, she worked for two years at Microsoft Research in India on large-scale distributed systems for Machine Learning. She recently won a Best Paper Award at Eurosys 2022 for Varuna.