Crossbow: Scalable Multi-GPU Deep Learning

Crossbow is a multi-GPU system for training deep learning models that allows users to choose freely their preferred batch size, however small, while scaling to multiple GPUs.

Crossbow utilises modern GPUs better than other systems by training multiple model replicas on the same GPU. When the batch size is sufficiently small to leave GPU resources unused, Crossbow trains a second model replica, a third, etc., as long as training throughput increases.

To synchronise many model replicas, Crossbow uses synchronous model averaging to adjust the trajectory of each individual replica based on the average of all. With model averaging, the batch size does not increase linearly with the number of model replicas, as it would with synchronous SGD. This yields better statistical efficiency without cumbersome hyper-parameter tuning when trying to scale training to a larger number of GPUs.

Crossbow is available on GitHub:

Alexandros Koliousis (Graphcore, UK)
Luo Mai (University of Edinburgh)
Matthias Weidlich (Humboldt University, Germany)
Paolo Costa (Microsoft Research Cambridge)
Pijika Watcharapichat (Microsoft Research Cambridge)

Related Publications

Alexandros Koliousis, Pijika Watcharapichat, Matthias Weidlich, Luo Mai, Paolo Costa, and Peter Pietzuch
45th International Conference on Very Large Data Bases (VLDB), 2019
Volume 11, Los Angelos, CA, USA
Luo Mai, Alexandros Koliousis, Guo Li, Andrei-Octavian Brabete, and Peter Pietzuch
ACM SIGOPS Operating Systems Review, 2019
Volume 53, Issue 1