Crossbow | LSDS - Large-Scale Data & Systems Group, Imperial College London

Crossbow is a multi-GPU system for training deep learning models that allows users to choose freely their preferred batch size, however small, while scaling to multiple GPUs.

Crossbow utilises modern GPUs better than other systems by training multiple model replicas on the same GPU. When the batch size is sufficiently small to leave GPU resources unused, Crossbow trains a second model replica, a third, etc., as long as training throughput increases.

To synchronise many model replicas, Crossbow uses synchronous model averaging to adjust the trajectory of each individual replica based on the average of all. With model averaging, the batch size does not increase linearly with the number of model replicas, as it would with synchronous SGD. This yields better statistical efficiency without cumbersome hyper-parameter tuning when trying to scale training to a larger number of GPUs.

Crossbow is available on GitHub: https://github.com/lsds/Crossbow

Alexandros Koliousis (Graphcore, UK)

Luo Mai (University of Edinburgh)

Matthias Weidlich (Humboldt University, Germany)

Paolo Costa (Microsoft Research Cambridge)

Peter Pietzuch

Pijika Watcharapichat (Microsoft Research Cambridge)

Crossbow: Scalable Multi-GPU Deep Learning

Related Publications