Survey on System Optimization for Mixture of Experts in the Era of Large Models

Shi Hongzhi; Zhao Jian; Zhao Yaqian; Li Ruyang; Wei Hui; Hu Kekun; Wen Dongchao; Jin Liang

doi:10.7544/issn1000-1239.202440016

Journal of Computer Research and Development > 2025 > Uncorrected proof > DOI: 10.7544/issn1000-1239.202440016 CSTR: 32373.14.issn1000-1239.202440016

Shi Hongzhi, Zhao Jian, Zhao Yaqian, Li Ruyang, Wei Hui, Hu Kekun, Wen Dongchao, Jin Liang. Survey on System Optimization for Mixture of Experts in the Era of Large Models[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202440016

Citation:

PDF (2663 KB)

Survey on System Optimization for Mixture of Experts in the Era of Large Models

IEIT SYSTEMS Co., Ltd., Jinan 250101
Inspur (Beijing) Electronic Information Industry Co., Ltd., Beijing 100095

Funds: This work was supported by Shandong Provincial Natural Science Foundation (ZR2020QF035)

More Information

Author Bio:
Shi Hongzhi: born in 1988. Master. Member of CCF. His main research interests include computer architecture and deep learning

Zhao Jian: born in 1987. Master. His main research interests include computer architecture, and deep learning

Zhao Yaqian: born in 1981. PhD, senior engineer. Senior member of CCF. Her main research interests include computer architecture and artificial intelligence

Li Ruyang: born in 1990. PhD, senior engineer. Senior member of CCF. Her main research interests include deep reinforcement learning, autonomous driving, and scenario-oriented AI acceleration

Wei Hui: born in 1987. PhD, senior engineer. Member of CCF. His main research interests include virtual reality and 3D vision

Hu Kekun: born in 1987. PhD, senior engineer. Member of CCF. His main research interests include graph computing and graph deep learning

Wen Dongchao: born in 1979. Master, professor. Member of IEEE and CCF. His main research interests include computer vision and trustworthy deep learning

Jin Liang: born in 1986. Master. His main research interests include computer vision and multimodalities
Received Date: January 11, 2024
Revised Date: September 17, 2024
Accepted Date: October 14, 2024
Available Online: December 11, 2024

Graphical Abstract

Abstract

Abstract

In recent years, large models have made unprecedented progresses in variety of domains, such as natural language processing and machine vision. Mixture of experts (MoE) has emerged as one of the most popular architectures for large models due to its distinct advantages in model parameter scalability, computational cost control and complex task processing. However, with the continuous increase of the parameter scale, the execution efficiency and scalability of the system are becoming increasingly challenging to meet the demand, and must be addressed urgently. The system optimization approach is an effective solution to solve this problem, which has become a hot research area. In light of this, we review the present research status of MoE system optimization techniques in the era of large model in this paper. To begin, we describe the present development state of work for MoE large model, and analyze the performance bottlenecks it faces on the system side. Then, we comprehensively sort out and deeply analyze the most recent research progress from four system core dimensions, ranging from memory occupation, communication latency, computational efficiency to parallel scaling, and compare and elaborate on the key technologies, application scenarios and optimization directions; finally, we summarize the current research state of MoE system optimization and outline some future research directions as well.
- large model,
- mixture of experts,
- memory offloading,
- hierarchical communication,
- expert placement,
- expert activation prediction,
- adaptive parallelism

FullText(HTML)

References (148)

References

[1]	Kaplan J, McCandlish S, Henighan T, et al. Scaling laws for neural language models[J]. arXiv preprint, arXiv: 2001.08361, 2020
[2]	Raffel C, Shazeer N, Roberts A, et al. Exploring the limits of transfer learning with a unified text-to-text transformer[J]. Journal of Machine Learning Research, 2020, 21(140): 1−67
[3]	Brown T, Mann B, Ryder N, et al. Language models are few-shot learners[C] //Proc of the 34th Int Conf on Neural Information Processing Systems. New York: Curran Associates, 2020: 1877−1901
[4]	Rae J W, Borgeaud S, Cai T, et al. Scaling language models: Methods, analysis & insights from training Gopher[J]. arXiv preprint, arXiv: 2112.11446, 2021
[5]	Chowdhery A, Narang S, Devlin J, et al. PaLM: Scaling language modeling with pathways[J]. arXiv preprint, arXiv:2204.02311, 2022
[6]	Jacobs R A, Jordan M I, Nowlan S J, et al. Adaptive mixtures of local experts[J]. Neural Computation, 1991, 3(1): 79−87 doi: 10.1162/neco.1991.3.1.79
[7]	Riquelme C, Puigcerver J, Mustafa B, et al. Scaling vision with sparse mixture of experts[C] //Proc of the 35th Int Conf on Neural Information Processing Systems. New York: Curran Associates, 2021: 8583−8595
[8]	Fan Zhiwen, Sarkar R, Jiang Ziyu, et al. M³ViT: Mixture-of-experts vision transformer for efficient multi-task learning with model-accelerator co-design[C] //Proc of the 36th Int Conf on Neural Information Processing Systems. New York: Curran Associates, 2022: 28441−28457
[9]	Li Bo, Shen Yifei, Yang Jingkang, et al. Sparse mixture-of-experts are domain generalizable learners[J]. arXiv preprint, arXiv: 2206.04046, 2023
[10]	Xue Fuzhao, Shi Ziji, Wei Futao, et al. Go wider instead of deeper[C] //Proc of the 36th AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2022: 8779−8787
[11]	Shen Sheng, Yao Zhewei, Li Chunyuan, et al. Scaling vision-language models with sparse mixture of experts[J]. arXiv preprint, arXiv: 2303.07226, 2023
[12]	Dai Yong, Tang Duyu, Liu Liangxin, et al. One model, multiple modalities: A sparsely activated approach for text, sound, image, video and code[J]. arXiv preprint, arXiv: 2205.06126, 2022
[13]	Mustafa B, Riquelme C, Puigcerver J, et al. Multimodal contrastive learning with LIMoE: The language-image mixture of experts[C] //Proc of the 36th Int Conf on Neural Information Processing Systems. New York: Curran Associates, 2022: 9564−9576
[14]	Kumatani K, Gmyr R, Salinas F C, et al. Building a great multi-lingual teacher with sparsely-gated mixture of experts for speech recognition[J]. arXiv preprint, arXiv: 2112.05820, 2022
[15]	You Zhao, Feng Shulin, Su Dan, et al. SpeechMoE: Scaling to large acoustic models with dynamic routing mixture of experts[J]. arXiv preprint, arXiv: 2105.03036, 2021
[16]	You Zhao, Feng Shulin, Su Dan, et al. Speechmoe2: Mixture-of-experts model with improved routing[C] //Proc of the 2022 IEEE Int Conf on Acoustics, Speech and Signal Processing. Piscataway, NJ: IEEE, 2022: 7217−7221
[17]	Li Dingcheng, Li Xu, Wang Jun, et al. Video recommendation with multi-gate mixture of experts soft actor critic[C] //Proc of the 43rd Int ACM SIGIR Conf on Research and Development in Information Retrieval. New York: ACM, 2020: 1553−1556
[18]	曹泽麟,徐君,董振华,等. 基于多任务学习的位置倾向性得分预测算法[J]. 计算机研究与发展,2023,60(1):85−94 Cao Zelin, Xu Jun, Dong Zhenhua, et al. Prediction of the positional propensity scores based on multi task learning[J]. Journal of Computer Research and Development, 2023, 60(1): 85−94 (in Chinese)
[19]	Fedus W, Zoph B, Shazeer N. Switch Transformers: Scaling to trillion parameter models with simple and efficient sparsity[J]. arXiv preprint, arXiv: 2101.03961, 2021
[20]	Shazeer N, Mirhoseini A, Maziarz K, et al. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer[J]. arXiv preprint, arXiv: 1701.06538, 2017
[21]	Zoph B, Bello I, Kumar S, et al. ST-MoE: Designing stable and transferable sparse expert models[J]. arXiv preprint, arXiv: 2202.08906, 2022
[22]	Lee-Thorp J, Ainslie J. Sparse mixers: Combining MoE and mixing to build a more efficient bert[J]. arXiv preprint, arXiv: 2205.12399, 2022
[23]	Kudugunta S, Huang Y, Bapna A, et al. Beyond distillation: Task-level mixture-of-experts for efficient inference[J]. arXiv preprint, arXiv: 2110.03742, 2021
[24]	Du Nan, Huang Yanping, Dai A M, et al. GLaM: Efficient scaling of language models with mixture-of-experts[C] //Proc of the 39th Int Conf on Machine Learning. New York: PMLR, 2022: 5547−5569
[25]	Lou Yuxuan, Xue Fuzhao, Zheng Zangwei, et al. Cross-token modeling with conditional computation[J]. arXiv preprint, arXiv: 2109.02008, 2022
[26]	Lin Junyang, Men Rui, Yang An, et al. M6: A Chinese multimodal pretrainer[J]. arXiv preprint, arXiv: 2103.00823, 2021
[27]	Lin Junyang, Yang An, Bai Jinze, et al. M6-10T: A sharing-delinking paradigm for efficient multi-trillion parameter pretraining[J]. arXiv preprint, arXiv: 2110.03888, 2021
[28]	Ren Xiaozhe, Zhou Pingyi, Meng Xinfan, et al. PANGU-Σ: Towards trillion parameter language model with sparse heterogeneous computing[J]. arXiv preprint, arXiv: 2303.10845, 2023
[29]	Jiang A Q, Sablayrolles A, Roux A, et al. Mixtral of experts[J]. arXiv preprint, arXiv: 2401.04088, 2024
[30]	Nguyen H D, Chamroukhi F. Practical and theoretical aspects of mixture‐of‐experts modeling: An overview[J]. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2018, 8(4): e1246 doi: 10.1002/widm.1246
[31]	Masoudnia S, Ebrahimpour R. Mixture of experts: A literature survey[J]. Artificial Intelligence Review, 2014, 42(2): 275−293 doi: 10.1007/s10462-012-9338-y
[32]	Yuksel S E, Wilson J N, Gader P D. Twenty years of mixture of experts[J]. IEEE Transactions on Neural Networks and Learning Systems, 2012, 23(8): 1177−1193 doi: 10.1109/TNNLS.2012.2200299
[33]	Fedus W, Dean J, Zoph B. A review of sparse expert models in deep learning[J]. arXiv preprint, arXiv: 2209.01667, 2022
[34]	Liu Tianlin, Blondel M, Riquelme C, et al. Routers in vision mixture of experts: An empirical study[J]. arXiv preprint, arXiv: 2401.15969, 2024
[35]	Cai Weilin, Jiang Juyong, Wang Fan, et al. A survey on mixture of experts[J]. arXiv preprint, arXiv: 2407.06204, 2024
[36]	Lepikhin D, Lee H, Xu Yuandong, et al. GShard: Scaling giant models with conditional computation and automatic sharding [J]. arXiv preprint, arXiv: 2006.16668, 2020
[37]	Zhang Zhengyan, Lin Yankai, Liu Zhiyuan, et al. MoEfication: Transformer feed-forward layers are mixtures of experts[J]. arXiv preprint, arXiv: 2110.01786, 2021
[38]	Zuo Simiao, Zhang Qingru, Liang Chen, et al. MoEBERT: From BERT to mixture-of-experts via importance-guided adaptation[J]. arXiv preprint, arXiv: 2204.07675, 2022
[39]	Zhu Tong, Qu Xiaoye, Dong Daize, et al. LLaMA-MoE: Building mixture-of-experts from LLaMA with continual pre-training[J]. arXiv preprint, arXiv: 2406.16554, 2024
[40]	Dai Damai, Deng Chenqi, Zhao Chenggang, et al. DeepSeekMoE: Towards ultimate expert specialization in mixture-of-experts language models[J]. arXiv preprint, arXiv: 2401.06066, 2024
[41]	Xue Fuzhao, Zheng Zian, Fu Yao, et al. OpenMoE: An early effort on open mixture-of-experts language models[J]. arXiv preprint, arXiv: 2402.01739, 2024
[42]	xAI. Open release of Grok-1[EB/OL]. [2024-08-02]. https://x.ai/blog/ grok-os
[43]	Reid M, Savinov N, Teplyashin D, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context[J]. arXiv preprint, arXiv: 2403.05530, 2024
[44]	Snowflake AI Research Team. Snowflake arctic: The best LLM for enterprise AI — Efficiently intelligent, truly open[EB/OL]. [2024-08-02]. https://www.snowflake.com/en/blog/arctic-open-efficient-foundation-language-models-snowflake/
[45]	The Mosaic Research Team. Introducin DBRX: A new state-of-the-art open LLM[EB/OL]. 2024 [2024-08-02]. https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm
[46]	Choquette J, Gandhi W, Giroux O, et al. Nvidia A100 tensor core GPU: Performance and innovation[J]. IEEE Micro, 2021, 41(2): 29−35 doi: 10.1109/MM.2021.3061394
[47]	Choquette J. Nvidia Hopper H100 GPU: Scaling performance[J]. IEEE Micro, 2023, 43(3): 9−17 doi: 10.1109/MM.2023.3256796
[48]	Ren Jie, Rajbhandari S, Aminabadi R Y, et al. ZeRO-Offload: Democratizing billion-scale model training[C] //Proc of the 2021 USENIX Annual Technical Conf. Berkeley, CA: USENIX Association, 2021: 551−564
[49]	Chen Xiaoming, Chen D Z, Hu X S. MoDNN: Memory optimal DNN training on GPUs[C] //Proc of the 21st conf & Exhibition on Design, Automation & Test in Europe. Piscataway, NJ: IEEE, 2018: 13−18
[50]	Shriram S B, Garg A, Kulkarni P. Dynamic memory management for GPU-based training of deep neural networks[C] //Proc of the 33rd IEEE Int Parallel and Distributed Processing Symp. Piscataway, NJ: IEEE, 2019: 200−209
[51]	Ren Jie, Luo Jiaolin, Wu Kai, et al. Sentinel: Efficient tensor migration and allocation on heterogeneous memory systems for deep learning[C] //Proc of the 27th IEEE Transactions Symp on High Performance Computer Architecture. Piscataway, NJ: IEEE, 2021: 598−611
[52]	Huang C C, Jin Gu, Li Jinyang. SwapAdvisor: Pushing deep learning beyond the GPU memory limit via smart swapping[C] //Proc of the 25th Int Conf on Architectural Support for Programming Languages and Operating Systems. New York: ACM, 2020: 1341−1355
[53]	Huang Haiyang, Ardalani N, Sun Anna, et al. Towards MoE deployment: Mitigating inefficiencies in mixture-of-expert (MoE) inference[J]. arXiv preprint, arXiv: 2303.06182, 2023
[54]	Eliseev A, Mazur D. Fast inference of mixture-of-experts language models with offloading[J]. arXiv preprint, arXiv: 2312.17238, 2023
[55]	Kong Rui, Li Yuanchun, Feng Qingtian, et al. Serving MoE models on resource-constrained edge devices via dynamic expert swapping[J]. arXiv preprint, arXiv: 2308.15030, 2023
[56]	Hwang R, Wei Jianyu, Cao Shijie, et al. Pre-gated MoE: An algorithm-system co-design for fast and scalable mixture-of-expert inference[J]. arXiv preprint, arXiv: 2308.12066, 2023
[57]	Shen Liang, Wu Zhihua, Gong Weibao, et al. SE-MoE: A scalable and efficient mixture-of-experts distributed training and inference system[J]. arXiv preprint, arXiv: 2205.10034, 2023
[58]	Liu Juncai, Wang J H, Jiang Yimin. Janus: A unified distributed training framework for sparse mixture-of-experts models[C] //Proc of the 37th ACM SIGCOMM 2023 Conf. New York: ACM, 2023: 486−498
[59]	Kim Y, Lim H, Han D. Scaling beyond the GPU memory limit for large mixture-of-experts model training[C] //Proc of the 41st Int Conf on Machine Learning. New York: PMLR, 2024: 24342−24353
[60]	Yi Rongjie, Guo Liwei, Wei Shiyun, et al. EdgeMoE: Fast on-device inference of MoE-based large language models[J]. arXiv preprint, arXiv: 2308.14352, 2023
[61]	Xue Leyang, Fu Yao, Lu Zhan, et al. MoE-Infinity: Activation-aware expert offloading for efficient MoE serving[J]. arXiv preprint, arXiv: 2401.14361, 2024
[62]	Kamahori K, Gu Yile, Zhu Kan, et al. Fiddler: CPU-GPU orchestration for fast inference of mixture-of-experts models[J]. arXiv preprint, arXiv: 2402.07033, 2024
[63]	Zhang Zheng, Xia Yaqi, Wang Hulin, et al. MPipeMoE: Memory efficient MoE for pre-trained models with adaptive pipeline parallelism[C] //Proc of the 37th IEEE Parallel and Distributed Processing Symp. Piscataway, NJ: IEEE, 2023: 167−177
[64]	Jain P, Jain A, Nrusimha A, et al. Checkmate: Breaking the memory wall with optimal tensor rematerialization[C/OL] //Proc of the 3rd Machine Learning and Systems. 2020 [2024-08-02]. https://proceedings.mlsys.org/paper_files/paper/2020/file/0b816ae8f06f8dd3543dc3d9ef196cab-Paper.pdf
[65]	Chen Tianqi, Xu Bing, Zhang Chiyuan, et al. Training deep nets with sublinear memory cost[J]. arXiv preprint, arXiv: 1604.06174, 2016
[66]	Peng Xuan, Shi Xuanhua, Dai Hulin, et al. Capuchin: Tensor-based GPU memory management for deep learning[C] //Proc of the 25th Int Conf on Architectural Support for Programming Languages and Operating Systems. New York: ACM, 2020: 891−905
[67]	Wang Linnan, Ye Jinmian, Zhao Yiyang, et al. SuperNeurons: Dynamic GPU memory management for training deep neural networks[J]. ACM SIGPLAN Notices, 2018, 53(1): 41−53 doi: 10.1145/3200691.3178491
[68]	Kim Y J, Awan A A, Muzio A, et al. Scalable and efficient MoE training for multitask multilingual models[J]. arXiv preprint, arXiv: 2109.10465, 2021
[69]	Singh S, Ruwase O, Awan A A, et al. A hybrid tensor-expert-data parallelism approach to optimize mixture-of-experts training[C] //Proc of the 37th Int Conf on Supercomputing. New York: ACM, 2023: 203−214
[70]	Heo T, Rashidi S, Man C, et al. Exploring memory expansion designs for training mixture-of-experts models[C/OL] //Proc of the 1st Workshop on Hot Topics in System Infrastructure. 2024 [2024-08-02]. https:// hotinfra23.github.io/papers/hotinfra23-paper4.pdf
[71]	Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural network[J]. arXiv preprint, arXiv: 1503.02531, 2015
[72]	Molchanov P, Tyree S, Karras T, et al. Pruning convolutional neural networks for resource efficient inference[J]. arXiv preprint, arXiv: 1611.06440, 2016
[73]	Han Song, Mao Huizi, Dally W J. Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding[J]. arXiv preprint, arXiv: 1510.00149, 2016
[74]	Dong Zhen, Yao Zhewei, Arfeen D, et al. HAWQ-V2: Hessian aware trace-weighted quantization of neural networks[C] //Proc of the 34th Int Conf on Neural Information Processing Systems. New York: Curran Associates, 2020: 18518−18529
[75]	Micikevicius P, Narang S, Alben J, et al. Mixed precision training[J]. arXiv preprint, arXiv: 1710.03740, 2018
[76]	Dabre R, Fujita A. Recurrent stacking of layers in neural networks: An application to neural machine translation[J]. arXiv preprint, arXiv: 2106.10002, 2021
[77]	Lan Zhenzhong, Chen Mingda, Goodman S, et al. ALBERT: A lite BERT for self-supervised learning of language representations[J]. arXiv preprint, arXiv: 1909.11942, 2020
[78]	Rajbhandari S, Li Conglong, Yao Zhewei, et al. DeepSpeed-MoE: Advancing mixture-of-experts inference and training to power next-generation AI scale[C] //Proc of the 39th Int Conf on Machine Learning. New York: PMLR, 2022: 18332−18346
[79]	Koishekenov Y, Berard A, Nikoulina V. Memory-efficient NLLB−200: Language-specific expert pruning of a massively multilingual machine translation model[J]. arXiv preprint, arXiv: 2212.09811, 2022
[80]	Lu Xudong, Liu Qi, Xu Yuhui, et al. Not all experts are equal: Efficient expert pruning and skipping for mixture-of-experts large language models[J]. arXiv preprint, arXiv: 2402.14800, 2024
[81]	Muzio A, Sun A, He C. SEER-MoE: Sparse expert efficiency through regularization for mixture-of-experts[J]. arXiv preprint, arXiv: 2404.05089, 2024
[82]	Chowdhury M N R, Wang Meng, Maghraoui K E, et al. A provably effective method for pruning experts in fine-tuned sparse mixture-of-experts[J]. arXiv preprint, arXiv: 2405.16646, 2024
[83]	Liu Enshu, Zhu Junyi, Lin Zinan, et al. Efficient expert pruning for sparse mixture-of-experts language models: Enhancing performance and reducing inference costs[J]. arXiv preprint, arXiv: 2407.00945, 2024
[84]	Kim Y J, Fahim R, Awadalla H H. Mixture of quantized experts (MoQE): Complementary effect of low-bit quantization and robustness[J]. arXiv preprint, arXiv: 2310.02410, 2023
[85]	Frantar E, Alistarh D. QMoE: Practical sub−1-bit compression of trillion-parameter models[J]. arXiv preprint, arXiv: 2310.16795, 2023
[86]	Kim Y J, Henry R, Fahim R, et al. Who says elephants can’t run: Bringing large scale MoE models into cloud scale production[J]. arXiv preprint, arXiv: 2211.10017, 2022
[87]	Kim Y J, Henry R, Fahim R, et al. FineQuant: Unlocking efficiency with fine-grained weight-only quantization for LLMs[J]. arXiv preprint, arXiv: 2308.09723, 2023
[88]	Imani H R, Amirany A, El-Ghazawi T. Mixture of experts with mixture of precisions for tuning quality of service[J]. arXiv preprint, arXiv: 2407.14417, 2024
[89]	Gao Zefeng, Liu Peiyu, Zhao Xin, et al. Parameter-efficient mixture-of-experts architecture for pre-trained language models[J]. arXiv preprint, arXiv: 2203.01104, 2022
[90]	He S, Fan R Z, Ding Liang, et al. Merging experts into one: Improving computational efficiency of mixture of experts[J]. arXiv preprint, arXiv: 2310.09832, 2023
[91]	Zhang Rongyu, Luo Yulin, Liu Jiaming, et al. Efficient deweahter mixture-of-experts with uncertainty-aware feature-wise linear modulation[C] //Proc of the 38th AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2024: 16812−16820
[92]	Chen Tianyu, Huang Shaohan, Xie Yuan, et al. Task-specific expert pruning for sparse mixture-of-experts[J]. arXiv preprint, arXiv: 2206.00277, 2022
[93]	Xue Fuzhao, He Xiaoxin, Ren Xiaozhe, et al. One student knows all experts know: From sparse to dense[J]. arXiv preprint, arXiv: 2201.10890, 2022
[94]	Li Jiamin, Jiang Yimin, Zhu Yibo, et al. Accelerating distributed MoE training and inference with Lina[C] //Proc of the 2023 USENIX Annual Technical Conf. Berkeley, CA: USENIX Association, 2023: 945−959
[95]	Nie Xiaonan, Zhao Pinxue, Miao Xupeng, et al. HetuMoE: An efficient trillion-scale mixture-of-expert distributed training system[J]. arXiv preprint, arXiv: 2203.14685, 2022
[96]	Hwang C, Cui Wei, Xiong Yifan, et al. Tutel: Adaptive mixture-of-experts at scale[C/OL] //Proc of the 6th Machine Learning and Systems. 2023 [2024-08-02]. https://proceedings.mlsys.org/paper_files/paper/2023/file/5616d34cf8ff73942cfd5aa922842556-Paper-mlsys2023.pdf
[97]	He Chaoyang, Zheng Shuai, Zhang A, et al. SMILE: Scaling mixture-of-experts with efficient bi-level routing[J]. arXiv preprint, arXiv: 2212.05191, 2022
[98]	Shoeybi M, Patwary M, Puri R, et al. Megatron-LM: Training multi-billion parameter language models using model parallelism[J]. arXiv preprint, arXiv: 1909.08053, 2020
[99]	Yao Jinghan, Anthony Q, Shafi A, et al. Exploiting inter-layer expert affinity for accelerating mixture-of-experts model inference[C] //Proc of the 38th IEEE Int Parallel and Distributed Processing Symp. Piscataway, NJ: IEEE, 2024: 915−925
[100]	Liu Rui, Kim Y J, Muzio A, et al. Gating Dropout: Communication-efficient regularization for sparsely activated transformers[C] //Proc of the 39th Int Conf on Machine Learning. New York: PMLR, 2022: 13782−13792
[101]	He Jiaao, Zhai Jidong, Antunes T, et al. FasterMoE: Modeling and optimizing training of large-scale dynamic pre-trained models[C] //Proc of the 27th ACM SIGPLAN Symp on Principles and Practice of Parallel Programming. New York: ACM, 2022: 120−134
[102]	Chen Chang, Li Min, Wu Zhihua, et al. TA-MoE: Topology-aware large scale mixture-of-expert training[C] //Proc of the 36th Int Conf on Neural Information Processing Systems. New York: Curran Associates, 2022: 22173−22186
[103]	Zeng Zhiyuan, Xiong Deyi. SCoMoE: Efficient mixtures of experts with structured communication[C] //Proc of the 11th Int Conf on Learning Representations. Amherst, MA: OpenReview. net, 2023: 1−23
[104]	Kossmann F, Jia Zhihao, Aiken A. Optimizing mixture of experts using dynamic recompilations[J]. arXiv preprint, arXiv: 2205.01848, 2022
[105]	Zheng Bojian, Jiang Ziheng, Yu C H, et al. DietCode: Automatic optimization for dynamic tensor programs[C/OL] //Proc of the 5th Machine Learning and Systems. 2022 [2024-08-02]. https://proceedings.mlsys.org/paper_files/paper/2022/file/f89b79c9a28d4cae22ef9e557d9fa191-Paper.pdf
[106]	Zheng Zhen, Pan Zaifeng, Wang Dalin, et al. BladeDISC: Optimizing dynamic shape machine learning workloads via compiler approach[J]. Proceedings of ACM on Management of Data, 2023, 1(3): 1−29
[107]	Chen Simin, Wei Shiyi, Liu Cong, et al. DyCL: Dynamic neural network compilation via program rewriting and graph optimization[C] //Proc of the 32nd ACM SIGSOFT Int Symp on Software Testing and Analysis. New York: ACM, 2023: 614−626
[108]	Yu Feng, Li Guanglin, Zhao Jiacheng, et al. Optimizing dynamic-shape neural networks on accelerators via on-the-fly micro-kernel polymerization[C] //Proc of the 29th ACM Int Conf on Architectural Support for Programming Languages and Operating Systems. New York: ACM, 2024: 797−812
[109]	He Jiaao, Qiu Jiezhong, Zeng Aohan, et al. FastMoE: A fast mixture-of-expert training system[J]. arXiv preprint, arXiv: 2103.13262, 2021
[110]	Gale T, Narayanan D, Young C, et al. MegaBlocks: Efficient sparse training with mixture-of-experts[C/OL] //Proc of the 6th Machine Learning and Systems. 2023 [2024-08-02]. https://proceedings.mlsys.org/paper_files/paper/2023/file/5a54f79333768effe7e8927bcccffe40-Paper-mlsys2023.pdf
[111]	Zheng Ningxin, Jiang Huiqiang, Zhang Quanlu, et al. PIT: Optimization of dynamic sparse deep learning models via permutation invariant transformation[C] //Proc of the 29th Symp on Operating Systems Principles. New York: ACM, 2023: 331−347
[112]	Tan S, Shen Yikang, Panda R, et al. Scattered mixture-of-experts implementation[J]. arXiv preprint, arXiv: 2403.08245, 2024
[113]	Nie Xiaonan, Miao Xupeng, Cao Shijie, et al. EvoMoE: An evolutional mixture-of-experts training framework via dense-to-sparse gate[J]. arXiv preprint, arXiv: 2112.14397, 2022
[114]	Zhou Yanqi, Lei Tao, Liu Hanxiao, et al. Mixture-of-experts with expert choice routing[C] //Proc of the 36th Int Conf on Neural Information Processing Systems. New York: Curran Associates, 2022: 7103−7114
[115]	Zeng Zhiyuan, Guo Qipeng, Fei Zhaoye, et al. Turn waste into worth: Rectifying top-k router of MoE[J]. arXiv preprint, arXiv: 2402.12399, 2024
[116]	Lewis M, Bhosale S, Dettmers T, et al. Base layers: Simplifying training of large, sparse models[C] //Proc of the 38th Int Conf on Machine Learning. New York: PMLR, 2021: 6265−6274
[117]	Clark A, Casas D D L, Guy A, et al. Unified scaling laws for routed language models[C] //Proc of the 39th Int Conf on Machine Learning. New York: PMLR, 2022: 4057−4086
[118]	Liu Tianlin, Puigcerver J, Blondel M. Sparsity-constrained optimal transport[J]. arXiv preprint, arXiv: 2209.15466, 2023
[119]	Roller S, Sukhbaatar S, Weston J, et al. Hash layers for large sparse models[C] //Proc of the 35th Int Conf on Neural Information Processing Systems. New York: Curran Associates, 2021: 17555−17566
[120]	Zuo Simiao, Liu Xiaodong, Jiao Jian, et al. Taming sparsely activated transformer with stochastic experts[J]. arXiv preprint, arXiv: 2110.04260, 2022
[121]	Puigcerver J, Riquelme C, Mustafa B, et al. From sparse to soft mixtures of experts[J]. arXiv preprint, arXiv: 2308.00951, 2023
[122]	Yu Ping, Artetxe M, Ott M, et al. Efficient language modeling with sparse all-MLP[J]. arXiv preprint, arXiv: 2203.06850, 2022
[123]	Muqeeth M, Liu Haokun, Raffel C. Soft merging of experts with adaptive routing[J]. arXiv preprint, arXiv: 2306.03745, 2023
[124]	Hazimeh H, Zhao Zhe, Chowdhery A, et al. DSelect-k: Differentiable selection in the mixture of experts with applications to multi-task learning[C] //Proc of the 35th Int Conf on Neural Information Processing Systems. New York: Curran Associates, 2021: 29335−29347
[125]	Ibrahim S, Chen W, Hazimeh H, et al. COMET: Learning cardinality constrained mixture of experts with trees and local search[C] //Proc of the 29th ACM SIGKDD Conf on Knowledge Discovery and Data Mining. New York: ACM, 2023: 832−844
[126]	Zhai Mingshu, He Jiaao, Ma Zixuan, et al. SmartMoE: Efficiently training sparsely-activated models through combining offline and online parallelization[C] //Proc of the 2023 USENIX Annual Technical Conf. Berkeley, CA: USENIX Association, 2023: 961−975
[127]	Nie Xiaonan, Miao Xupeng, Wang Zilong, et al. FlexMoE: Scaling large-scale sparse pre-trained model training via dynamic device placement[J]. Proceedings of ACM on Management of Data, 2023, 1(1): 1−19
[128]	Du Zhixu, Li Shiyu, Wu Yuhao, et al. SiDA: Sparsity-inspired data-aware serving for efficient and scalable large mixture-of-experts models[C/OL] //Proc of the 7th Machine Learning and Systems. 2024 [2024-08-02]. https://proceedings.mlsys.org/paper_files/paper/2024/file/698cfaf72a208aef2e78bcac55b74328-Paper-Conference.pdf
[129]	Shazeer N, Cheng Youlong, Parmar N, et al. Mesh-Tensorflow: Deep learning for supercomputers[C] //Proc of the 32nd Int Conf on Neural Information Processing Systems. New York: Curran Associates, 2018: 10435−10444
[130]	Rajbhandari S, Rasley J, Ruwase O, et al. ZeRO: Memory optimizations toward training trillion parameter models[C/OL] //Proc of Int Conf for High Performance Computing, Networking, Storage and Analysis. Piscataway, NJ: IEEE, 2020 [2024-09-10]. https://dl.acm.org/ doi/abs/10.5555/3433701.3433727
[131]	Kosson A, Chiley V, Venigalla A, et al. Pipelined backpropagation at scale: Training large models without batches[C/OL] //Proc of the 4th Machine Learning and Systems. 2021 [2024-08-02]. https://proceedings.mlsys.org/paper_files/paper/2021/file/0c8abcf158ed12d0dd94480681186fda-Paper.pdf
[132]	Huang Yanping, Cheng Youlong, Bapna A, et al. GPipe: Efficient training of giant neural networks using pipeline parallelism[C] //Proc of the 33rd Int Conf on Neural Information Processing Systems. New York: Curran Associates, 2019: 103−112
[133]	Narayanan D, Harlap A, Phanishayee A, et al. PipeDream: Generalized pipeline parallelism for dnn training[C/OL] //Proc of the 27th ACM Symp on Operating Systems Principles. New York: ACM, 2019 [2024-09-10]. https://dl.acm.org/doi/abs/10.1145/3341301.3359646
[134]	Narayanan D, Phanishayee A, Shi Kaiyu, et al. Memory-efficient pipeline-parallel DNN training[C] //Proc of the 38th Int Conf on Machine Learning. New York: PMLR, 2021: 7937−7947
[135]	Fu Yichao, Yuhao Qing, Zhao Shixiong, et al. AMPipe: Accelerating MoE model training with intra-block pipelining[EB/OL]. 2024 [2024-08-02]. https://openreview.net/pdf?id=yLgr02IsXY
[136]	Jiang Chenyu, Tian Ye, Jia Zhen, et al. Lancet: Accelerating mixture-of-experts training via whole graph computation-communication overlapping[J]. arXiv preprint, arXiv: 2404.19429, 2024
[137]	Shi Shaohuai, Pan Xinglin, Chu Xiaowen, et al. PipeMoE: Accelerating mixture-of-experts through adaptive pipelining[C/OL] //Proc of the 2023 IEEE Conf on Computer Communications. Piscataway, NJ: IEEE, 2023 [2024-09-10]. https://ieeexplore.ieee.org/abstract/document/10228874
[138]	Aminabadi R Y, Rajbhandari S, Awan A A, et al. DeepSpeed-Inference: Enabling efficient inference of transformer models at unprecedented scale[C/OL] //Proc of Int Conf for High Performance Computing, Networking, Storage and Analysis. New York: ACM, 2022 [2024-09-10]. https://dl.acm.org/doi/abs/10.5555/3571885.3571946
[139]	Valiant L G. A bridging model for parallel computation[J]. Communications of the ACM, 1990, 33(8): 103−111 doi: 10.1145/79173.79181
[140]	Narayanan D, Shoeybi M, Casper J, et al. Efficient large-scale language model training on GPU clusters using Megatron-LM[C/OL] //Proc of Int Conf for High Performance Computing, Networking, Storage and Analysis. New York: ACM, 2021 [2024-09-10]. https://dl.acm.org/doi/ abs/10.1145/3458817.3476209
[141]	Wang Guanhua, Qin Heyang, Jacobs S A, et al. ZeRO++: Extremely efficient collective communication for giant model training[J]. arXiv preprint, arXiv: 2306.10209, 2023
[142]	Rajbhandari S, Ruwase O, Rasley J, et al. ZeRO-infinity: Breaking the GPU memory wall for extreme scale deep learning[C/OL] //Proc of Int Conf for High Performance Computing, Networking, Storage and Analysis. New York: ACM, 2021 [2024-09-10]. https://dl.acm.org/doi/ abs/10.1145/3458817.3476205
[143]	Jia Zhihao, Lin S, Qi C R, et al. Exploring hidden dimensions in parallelizing convolutional neural networks[C] // Proc of the 35th Int Conf on Machine Learning. New York: PMLR, 2018: 2279−2288
[144]	Zheng Lianmin, Li Zhuohan, Zhang Hao, et al. Alpa: Automating inter-and intra-operator parallelism for distributed deep learning[C] //Proc of the 16th USENIX Symp on Operating Systems Design and Implementation. Berkeley, CA: USENIX Association, 2022: 559−578
[145]	Li Zhuohan, Zheng Lianmin, Zhong Yinmin, et al. AlpaServe: Statistical multiplexing with model parallelism for deep learning serving[C] //Proc of the 17th USENIX Symp on Operating Systems Design and Implementation. Berkeley, CA: USENIX Association, 2023: 663−679
[146]	Jia Zhihao, Zaharia M, Aiken A. Beyond data and model parallelism for deep neural networks[C/OL] //Proc of the 2nd Machine Learning and Systems. 2019 [2024-08-02]. https://proceedings.mlsys.org/paper_files/paper/2019/file/b422680f3db0986ddd7f8f126baaf0fa-Paper.pdf
[147]	Artetxe M, Bhosale S, Goyal N, et al. Efficient large scale language modeling with mixtures of experts[J]. arXiv preprint, arXiv: 2112.10684, 2022
[148]	Zhao Yanli, Gu A, Varma R, et al. Pytorch FSDP: Experiences on scaling fully sharded data parallel[J]. arXiv preprint, arXiv: 2304.11277, 2023

[1]	Li Nan, Ding Yidong, Jiang Haoyu, Niu Jiafei, Yi Ping. Jailbreak Attack for Large Language Models: A Survey[J]. Journal of Computer Research and Development, 2024, 61(5): 1156-1181. DOI: 10.7544/issn1000-1239.202330962
[2]	Chen Xuanting, Ye Junjie, Zu Can, Xu Nuo, Gui Tao, Zhang Qi. Robustness of GPT Large Language Models on Natural Language Processing Tasks[J]. Journal of Computer Research and Development, 2024, 61(5): 1128-1142. DOI: 10.7544/issn1000-1239.202330801
[3]	Shu Wentao, Li Ruixiao, Sun Tianxiang, Huang Xuanjing, Qiu Xipeng. Large Language Models: Principles, Implementation, and Progress[J]. Journal of Computer Research and Development, 2024, 61(2): 351-361. DOI: 10.7544/issn1000-1239.202330303
[4]	Yang Yi, Li Ying, Chen Kai. Vulnerability Detection Methods Based on Natural Language Processing[J]. Journal of Computer Research and Development, 2022, 59(12): 2649-2666. DOI: 10.7544/issn1000-1239.20210627
[5]	Pan Xuan, Xu Sihan, Cai Xiangrui, Wen Yanlong, Yuan Xiaojie. Survey on Deep Learning Based Natural Language Interface to Database[J]. Journal of Computer Research and Development, 2021, 58(9): 1925-1950. DOI: 10.7544/issn1000-1239.2021.20200209
[6]	Zheng Haibin, Chen Jinyin, Zhang Yan, Zhang Xuhong, Ge Chunpeng, Liu Zhe, Ouyang Yike, Ji Shouling. Survey of Adversarial Attack, Defense and Robustness Analysis for Natural Language Processing[J]. Journal of Computer Research and Development, 2021, 58(8): 1727-1750. DOI: 10.7544/issn1000-1239.2021.20210304
[7]	Pan Xudong, Zhang Mi, Yan Yifan, Lu Yifan, Yang Min. Evaluating Privacy Risks of Deep Learning Based General-Purpose Language Models[J]. Journal of Computer Research and Development, 2021, 58(5): 1092-1105. DOI: 10.7544/issn1000-1239.2021.20200908
[8]	Bao Yang, Yang Zhibin, Yang Yongqiang, Xie Jian, Zhou Yong, Yue Tao, Huang Zhiqiu, Guo Peng. An Automated Approach to Generate SysML Models from Restricted Natural Language Requirements in Chinese[J]. Journal of Computer Research and Development, 2021, 58(4): 706-730. DOI: 10.7544/issn1000-1239.2021.20200757
[9]	Yu Kai, Jia Lei, Chen Yuqiang, and Xu Wei. Deep Learning: Yesterday, Today, and Tomorrow[J]. Journal of Computer Research and Development, 2013, 50(9): 1799-1804.
[10]	Che Haiyan, Feng Tie, Zhang Jiachen, Chen Wei, and Li Dali. Automatic Knowledge Extraction from Chinese Natural Language Documents[J]. Journal of Computer Research and Development, 2013, 50(4): 834-842.