Seminars

Building Four-dimensional Parallel Training Systems for Large AI Models

Heming Cui, University of Hong Kong

Abstract

The increasing modeling capacities of large DNNs (e.g., Transformer and GPT-3) have achieved unprecedented successes in various AI areas, including understanding vision and natural languages. The high modeling power a large DNN mainly stems from its increasing complexity (having more neuron layers and more neuron operators in each layer) and dynamicity (frequently activating/deactivating neuron operators in each layer during training, such as Neural Architecture Search, or NAS). Such complexity and dynamicity can easily make a large DNN exceed the computing and memory capacities of a modern GPU, so training a large DNN often often needs to split the DNN into many GPUs via multiple dimensions, including data parallelism, tensor parallelism, and pipeline parallelism. Dr. Cui’s talk will present his two recent papers, [vPipe TPDS 2021] and [NASPipe ASPLOS 2022], which address major limitations in existing multi-dimensional parallel training systems, including GPipe, Pipedream, and Megatron. vPipe focuses on addressing the severe load imbalance and low GPU computing utilization (e.g., merely 20% in some latest advanced systems); NASPipe will present Supernet parallelism, a new parallel training dimension for highly dynamic large DNNs designed in the Supernet and NAS manners (e.g., Evolved Transformer).

Please email for a Zoom link

About the speaker

Dr. Heming Cui (cs.hku.hk/people/academic-staff/heming) is an Associate Professor in HKU CS. His leads a large research group of about 15 ongoing PhD students in HKU, all supervised by himself. Dr. Cui is interested in building software infrastructures and tools to greatly improve the reliability, security and performance of real-world software. His recent research has led to a series of open source projects and publications in international top conferences and journals of broad areas, including SOSP, NSDI, ASPLOS, ATC, ICSE, EuroSys, TPDS, and TDSC. In recent three years, Dr. Cui serves on the program committees of international top systems/networking conferences, including NSDI, ATC, EuroSys, DSN, SOCC, and ICDCS. Dr. Cui received several worldwide competitive research awards, including a Croucher Innovation Award in 2016 and a best paper award from ACSAC2017. Dr. Cui's secure system papers (e.g., [Uranus AsiaCCS 2020] and [Sotter ATC 2022]) on Trusted Execution Environments have become the core commercial system of Huawei's Trusted and Intelligent Cloud Services (https://www.huaweicloud.com/product/tics.html). Dr Cui received his bachelor and master degrees from Tsinghua University, and PhD from Columbia University, all in Computer Science.

Date & Time

Thursday, May 19, 2022 - 14:00

Location

Online