北京时间2022年10月18日,第29届IEEE高性能计算机架构国际会议(HPCA:International Symposium on High-Performance Computer Architecture)公布论文入选结果。本次HPCA从投稿的364篇论文中收录91篇,接受率为25.0%。我实验室完成的论文“MPress:Democratizing Billion-Scale Model Training on Multi-GPU Servers via Memory-Saving Inter-Operator Parallelism”成功入选,该论文的第一作者为在读博士生周泉,参与的学生有王海权(博士在读)、于笑颜(硕士在读),指导老师为李诚特任研究员、许胤龙教授与颜枫教授,向他们表示祝贺!
大量研究和生产实践表明,深度神经网络模型越大,其准确率越高。然而,大模型训练面临着严重的内存墙问题。业界最先进的系统如ZeRO、Megatron等通过内存去重和3D并行等技术来压缩内存,但仍面临着内存压缩与计算效率、通信开销之间难以权衡的挑战。大模型训练的内存容量瓶颈已严重制约了人工智能应用的发展,逐渐成为基础核心系统领域国际领军互联网企业、国内外先进科研团队争相攻克的技术制高点。
在本文中,我们创新性地指出,在训练Transformer类大模型时,相比于基于数据并行的ZeRO等工作,流水行并行模式(Pipeline Parallelism)天然具有内存冗余少、卡间通信量小、计算密度高等优势,特别适合在小规模硬件平台上训练超大规模的模型。为此,我们设计了新颖的Device-to-Device swapping技术,在GPU之间利用NVLink高速互联通道交换模型数据,可使用处于流水线尾部的空闲GPU显存资源分担头部GPU的内存压力,在均衡GPU内存使用的同时,提升可训练模型的大小;同时结合了业界先进的中间结果重计算、异构计算等技术,来进一步对大模型的显存使用量进行压缩。实验结果表明,在DGX-1和DGX-2单机环境下训练同等规模的大模型,MPress相比于ZeRO系列性能上提升了1.7到2.3倍。目前,分布式版本MPress已经完成了研制,在可训练模型的大小和单位运算效率等指标上达到了国际先进水平。
MPress是继PaGraph(SoCC 2020,TPDS 2021)、HiPress(SOSP 2021)之后,我实验室在大规模深度神经网络模型训练加速上的又一次成功尝试。HiPress解决了数据并行模式下大规模训练遇到的通信墙问题,MPress关注流水线并行模式下的内存墙问题,两者互相补充,形成了一个“Press系列”。课题组成员获邀在SOSP、GNNSys、SoCC、ChinaSys等国内外学术会议以及华为STW、阿里洞见等企业高端论坛上发表学术演讲或特邀报告,汇报先进成果,受到工业界的关注。该系列工作旨在充分释放已有智能计算芯片的潜能,降低模型训练成本,从而为高端芯片的研制赢得时间,为化解高技术封锁和实现双碳目标提供有利条件。
此项工作得到了国家自然科学基金重点项目、教育部111引智计划、安徽省高校协同创新项目、双一流经费专项资金、并行与分布处理国防科技重点实验室开放课题等项目,以及OPPO、阿里巴巴、华为公司创新项目的联合资助,得到了国家高性能计算中心(合肥)、安徽省高性能计算重点实验室、亚马逊云计算平台、上海即算科技有限公司的支持。
论文题目:MPress:Democratizing Billion-Scale Model Training on Multi-GPU Servers via Memory-Saving Inter-Operator Parallelism
论文摘要:
It remains challenging to train billion-scale DNN models on a single modern multi-GPU server due to the GPU memory wall.
Unfortunately,existing memory-saving techniques such as GPU-CPU swap,recomputation,and ZeRO-Series come at the price of extra computation,communication overhead,or limited memory reduction.
We present MPress,a new single-server multi-GPU system that breaks the GPU memory wall of billion-scale model training while minimizing extra cost.MPress first discusses the trade-offs of various memory-saving techniques and offers a holistic solution,which alternatively chooses the inter-operator parallelism with low cross-GPU communication traffics,and combines with recomputation and swap,to balance training performance and sustained model sizes.Additionally,MPress employs a novel,fast D2D swap technique,which simultaneously utilizes multiple high-bandwidth NVLink to swap tensors to light-load GPUs,based on a key observation that inter-operator parallel training may result in imbalanced GPU memory utilization and spare memory space from least used devices plus the high-end interconnects among them have the opportunity to support low-overhead swapping.Finally,we integrate MPress with PipeDream and DAPPLE,two representative inter-operator parallel training systems.Experimental results with two popular DNN models,Bert,and GPT,on two modern GPU servers from the DGX-1 and DGX-2 generation,equipped with 8 V100 or A100 cards,respectively,demonstrate that MPress significantly improves the training throughput over ZeRO-Series with the identical memory reduction,while being able to train larger models than the recomputation baseline.