With the availability of servers with many GPUs, scalability in terms of the number of GPUs when training deep learning models becomes a paramount concern. Common systems train using synchronous SDG-an input batch is partitioned across the GPUs, each computing a partial gradient, which are then combined to update the model parameters. For many models, this introduces a scalability challenge: to keep GPUs fully utilised, the batch size must be sufficiently large, but a large batch size slows down model convergence.
This paper introduces CrossBow, a single-server multi-GPU deep learning system that trains multiple model replicas concurrently on each GPU, thereby avoiding under-utilisation of GPUs even when the preferred batch size is small. CrossBow automatically tunes the number of replicas per GPU and features a novel synchronisation scheme that eliminates dependencies among replicas. Our experiments show that CrossBow outperforms TensorFlow on a 4-GPU server by 2.5x with ResNet-32.