KungFu: Adaptive Distributed Machine Learning

KungFu is a novel distributed machine learning framework for TensorFlow.

Today's machine learning systems must cope with growing complex models and increasingly complicated deployment environments, making them difficult to constantly deliver high performance with an empirical configuration. To address this, KungFu enables machine learning users to realise adaptive distributed training policies using high-level training monitoring and control APIs. KungFu has a fast and scalable runtime which can automatically scale out policy execution onto distributed GPU servers. Large-scale cluster experiments show that KungFu not only enables real-world adaptive training use cases, but also out-performs state-of-the-art distributed training systems including Horovod and Parameters Servers.

KungFu is open-sourced at: https://github.com/lsds/KungFu.

Related Publications

Luo Mai, Guo Li, Marcel Wagenländer, Konstantinos Fertakis, Andrei-Octavian Brabete, and Peter Pietzuch
USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2020
Marcel Wagenländer, Luo Mai, Guo Li, and Peter Pietzuch
12th USENIX Workshop on Hot Topics in Cloud Computing (HotCloud), 2020
Boston, MA, USA
Luo Mai, Alexandros Koliousis, Guo Li, Andrei-Octavian Brabete, and Peter Pietzuch
ACM SIGOPS Operating Systems Review, 2019
Volume 53, Issue 1