Ako | LSDS - Large-Scale Data & Systems Group, Imperial College London

Ako 1. (verb) to learn, in Māori

Distributed systems for training deep neural networks (DNNs) with large amounts of data have vastly improved the accuracy of machine learning models for image and speech recognition. DNN systems scale to large cluster deployments by having worker nodes train many model replicas in parallel; to ensure model convergence, parameter servers periodically synchronise the replicas. This raises the challenge of how to split resources between workers and parameter servers so that the cluster CPU and network resources are fully utilised without introducing bottlenecks. In practice, this requires manual tuning for each model configuration or hardware type.

Ako is a decentralised dataflow-based DNN system without parameter servers designed to saturate cluster resources. All nodes execute workers that fully use the CPU resources to update model replicas. To synchronise replicas as often as possible subject to the available network bandwidth, workers exchange partitioned gradient updates directly with each other in a peer-to-peer fashion. The number of partitions is chosen so that the used network bandwidth remains constant, independently of cluster size. Since a worker receives a different gradient partition from other workers in each synchronisation round, partial gradient exchange can maintain convergence as workers eventually receive the complete model gradient with bounded delay.

Ako: Decentralised Deep Learning

Related Publications