Crossbow is a multi-GPU system for training deep learning models that allows users to choose freely their preferred batch size, however small, while scaling to multiple GPUs.
Crossbow utilises modern GPUs better than other systems by training multiple model replicas on the same GPU. When the batch size is sufficiently small to leave GPU resources unused, Crossbow trains a second model replica, a third, etc., as long as training throughput increases.
To synchronise many model replicas, Crossbow uses synchronous model averaging to adjust the trajectory of each individual replica based on the average of all. With model averaging, the batch size does not increase linearly with the number of model replicas, as it would with synchronous SGD. This yields better statistical efficiency without cumbersome hyper-parameter tuning when trying to scale training to a larger number of GPUs.
Crossbow is available on GitHub: https://github.com/lsds/Crossbow