经过持续三年多的努力,我们ADSL实验室与美国内华达大学的颜枫教授(今年获得NSF CAREER Award)带领的IDS实验室、德国贝尔实验室陈瑞川博士合作完成的HiPress系统被计算机系统领域国际顶会——第28届“ACM操作系统原理大会”(SOSP:ACM Symposium on Operating Systems Principles)收录,向各位参与研究工作的老师、同学、合作者表示祝贺。
在此项研究中,我们关注深度神经网络模型大规模数据并行训练的可扩展性问题。我们通过高效GPU梯度压缩算法自动化生成、压缩感知的梯度聚合协议、代价模型驱动的梯度选择性压缩和分片机制等优化方法,搭建了支持各类梯度压缩算法的高性能数据并行训练框架,横向兼容MXNet、TensorFlow、PyTorch等主流深度学习系统。最终,在业界常用模型上的训练速度相较最新系统有较为明显的性能提升,具有较高的实际应用潜力。
SOSP与OSDI并称为计算机系统领域的最高水平学术会议,两年召开一次,已有54年的悠久历史。本次SOSP大会共有348篇论文投稿,54篇被接收,接收率为15.5%,竞争十分激烈。大会将于10月25日至28日召开,本实验室博士生白有辉将代表参研师生向国内外同行介绍相关成果。
近年来,在实验室负责人许胤龙教授的带领下,在李诚特任研究员、李永坤副教授、吕敏副教授、吴思特任副研究员和诸多实验室同学的共同努力下,ADSL实验室在并行与分布式机器学习系统方向有了深入的认识和长期稳定的积累,这项工作的发表,标志着我们在该领域的认识和积累被国际领域认可,标志着我们实验室的科研水平迈入更高的层次。
此项工作得到了国家自然科学基金重点项目、青年项目、科技部重点研发课题、双一流学科建设、111协同创新、合肥市留学回国人员创新计划的资助,得到了国家高性能计算中心(合肥)、安徽省高性能计算重点实验室、超算中心、亚马逊云计算的平台支持。
论文题目:Gradient Compression Supercharged High-Performance Data Parallel DNN Training
论文摘要:Gradient compression is a promising approach to alleviating the communication bottleneck in data parallel deep neural network(DNN)training by significantly reducing the data volume of gradients for synchronization.While gradient compression is being actively adopted by the industry(e.g.,Facebook and AWS),our study reveals that there are two critical but often overlooked challenges:1)inefficient coordination between compression and communication during gradient synchronization incurs substantial overheads,and 2)developing,optimizing,and integrating gradient compression algorithms into DNN systems imposes heavy burdens on DNN practitioners,and ad-hoc compression implementations often yield surprisingly poor system performance.
In this paper,we first propose a compression-aware gradient synchronization architecture,CaSync,which relies on a flexible composition of basic computing and communication primitives.It is general and compatible with any gradient compression algorithms and gradient synchronization strategies,and enables high-performance computation-communication pipelining.We further introduce a gradient compression toolkit,CompLL,to enable efficient development and automated integration of on-GPU compression algorithms into DNN systems with little programming burden.Lastly,we build a compression-aware DNN training framework HiPress with CaSync and CompLL.HiPress is open-sourced and runs on mainstream DNN systems such as MXNet,TensorFlow,and PyTorch.Evaluation in a 16-node cluster with 128 NVIDIA V100 GPUs and 100Gbps network shows that HiPress improves the training speed up to 106.4%over the state-of-the-art across six popular DNN models.