边缘智能计算系统中加速推荐模型训练的样本调度机制

李国鹏; 谈海生; 张弛; 倪宏秋; 王子龙; 章馨月; 徐洋; 田晗; 陈国良

doi:10.7544/issn1000-1239.202550128

边缘智能计算系统中加速推荐模型训练的样本调度机制

1.
中国科学技术大学计算机科学与技术学院　合肥　230027
2.
中国科学技术大学人工智能与数据科学学院　合肥　230027

基金项目: 国家自然科学基金重点项目（62132009）

详细信息

作者简介:
李国鹏: 1997年生. 博士研究生. 主要研究方向为边缘智能、大模型驱动的智能体、机器学习系统

谈海生: 1981年生. 博士，教授. CCF会员.主要研究方向为边缘智能、人工智能系统与网络

张弛: 1995 年生. 博士，副教授. 主要研究方向为边缘计算、网络算法

倪宏秋: 2000年生, 博士研究生. 主要研究方向为边缘计算、大语言模型推理、机器学习系统

王子龙: 2000年生. 硕士研究生. 主要研究方向为边缘计算、调度机制、机器学习系统

章馨月: 2000年生, 博士研究生. 主要研究方向为边缘计算、服务器无感知计算、机器学习系统

徐洋: 2003年生. 硕士研究生，主要研究方向为边缘计算、机器学习系统、大语言模型

田晗: 1989 年生. 博士，副研究员. 主要研究方向为机器学习及其在网络、系统、隐私计算中的应用

陈国良: 1938年生. 教授. CCF会士. 主要研究方向为并行算法，计算机体系结构、计算智能

通讯作者:
谈海生（hstan@ustc.edu.cn）

中图分类号: TP303;TP393
计量
- 文章访问数: 78
- HTML全文浏览量: 9
- PDF下载量: 18
出版历程
- 收稿日期: 2025-02-28
- 修回日期: 2025-04-06
- 网络出版日期: 2025-04-14

Samples Dispatching Mechanism for Accelerating Recommendation Model Training in Edge Intelligent Computing System

1.
School of Computer Science and Technology, University of Science and Technology of China, Hefei 230027
2.
School of Artificial Intelligence and Data Science, University of Science and Technology of China, Hefei 230027

Funds: This work is supported by the Key Program of the National Natural Science Foundation of China (62132009).

More Information

Author Bio:
Li Guopeng: born in 1997. PhD candidate. His main research interests include edge intelligence, large language model-based agent, and machine learning system

Tan Haisheng: born in 1981. PhD, professor. Member of CCF. His main research interests include edge intelligence, and system and networking for AI

Zhang Chi: born in 1995. PhD, associate professor. His main research interests include edge computing and network algorithms

Ni Hongqiu: born in 2000. PhD candidate. Her main research interests include edge computing, large language model inference and machine learning system

Wang Zilong: born in 2000. Master candidate. His main research interests include edge computing, scheduling mechanism, and machine learning system

Zhang Xinyue: born in 2000. PhD candidate. Her main research interests include edge computing, serverless computing, and machine learning system

Xu Yang: born in 2003. Master candidate. His main research interests include edge computing, machine learning system， and large language model

Tian Han: born in 1989. PhD, associate professor. His main research interests include machine learning and its applications in networking, system and private computing

Chen Guoliang: born in 1938. Professor. Fellow of CCF. His main research interests include parallel algorithms, computer architectures, and computational intelligence

摘要

摘要:
在边缘智能计算系统中使用边缘工作节点训练深度学习推荐模型（DLRMs）具有诸多优势，尤其是在数据隐私保护、低延迟和个性化推荐等方面. 然而，由于嵌入表的规模庞大，在训练DLRM时通常采用一个或多个参数服务器来维护全局嵌入表，同时利用多个边缘节点缓存嵌入表的一部分. 在此架构下，需要在边缘节点和参数服务器间传输嵌入以保证嵌入数据一致性，嵌入传输代价通常主导了训练周期. 本文旨在研究在边缘智能计算系统中，当面对异构网络和资源受限等挑战时，如何将嵌入样本调度到合适的边缘节点上进行训练，以最小化总嵌入传输代价. 为此，本文提出了一个基于预期嵌入传输代价的嵌入样本调度机制ESD.在ESD中，本文设计了一个结合资源密集型最优解法和启发式解法的调度决策方法HybridDis，以实现决策质量和资源消耗之间的平衡. 本文使用C++和Python实现了ESD的原型系统，并在真实工作负载下将其与现有最先进的机制进行比较. 大量实验结果表明，ESD可将嵌入传输代价至多降低36.76%，并且在端到端DLRM训练速度上实现了最高1.74倍的加速¹.
- 分布式训练 /
- 边缘智能 /
- 深度学习 /
- 推荐模型 /
- 调度机制
Abstract:
Training deep learning recommendation models (DLRMs) using edge workers in edge intelligent computing system brings several benefits, particularly in terms of data privacy protection, low latency and personalization. However, due to the huge size of embedding tables, typical DLRM training frameworks adopt one or more parameter servers to maintain global embedding tables, while leveraging several edge workers to cache part of them. This incurs significant transmission cost for embedding transmissions between workers and parameter servers, which can dominate the training cycle. In this paper, we investigate how to dispatch input embedding samples to appropriate edge workers to minimize the total embedding transmission cost when facing edge-specific challenges such as heterogeneous networks and limited resources. We develop ESD, a novel mechanism that optimizes the dispatching of input embedding samples to edge workers based on expected embedding transmission cost. We propose HybridDis as the dispatch decision method within ESD, which combines a resource-intensive optimal algorithm and a heuristic algorithm to balance decision quality and resource consumption. We implement a prototype of ESD using C++ and Python and compare it with state-of-the-art mechanisms on real-world workloads. Extensive experimental results show that ESD reduces the embedding transmission cost by up to 36.76% and achieves up to 1.74x speedup in end-to-end DLRM training.
- distributed training /
- edge intelligence /
- deep learning /
- recommendation model /
- dispatch mechanism

HTML全文

随着人类日益增长的能源需求和不可再生资源的枯竭，核聚变能源由于其清洁性和安全性作为解决长期能源需求的解决方案，越来越受到人类社会的关注，目前正在建设中的国际热核实验反应堆（international thermonuclear experimental reactor，ITER）是实现核聚变能和平应用的重要里程碑. 磁约束核聚变是产生热核聚变能的最重要方法之一^[1-2]. 在反应堆中实现和维持等离子体聚变过程具有巨大的科学和技术挑战，其中针对等离子体稳定性的研究有助于理解、预测、控制和减轻等离子体破坏的威胁，是优化燃烧等离子体运行模式，改善等离子体约束和输运的重要保障，是设计和制造先进的核聚变装置的重要依据.

数值模拟是等离子体稳定性研究中的关键方法之一，相比理论研究，它能够分析复杂的物理过程，而相比实验研究，它更加经济和灵活. 在等离子体物理数值模拟研究中，回旋动理学理论经常被用来研究在拉莫尔半径空间尺度下的动理学不稳定性和湍流传输^[3-5]. 在回旋动理学理论中，通过回旋平均方法将描述分布函数的方程维度从6维降低到5维，使得其特别适用于研究更长时间尺度下的等离子体不稳定性和湍流传输物理过程.

粒子网格法（particle in cell，PIC）由于其良好的可扩展性、物理守恒性、波粒相互作用描述准确性等优势，在众多回旋动理学模拟算法中具有广泛适用度和应用前景^[6-8]. 基于PIC算法的突出特点，科研学者在解决特定时空尺度物理问题的同时，逐步向多时空尺度耦合的非线性复杂物理模拟演进. 其对磁约束核聚变高性能数值模拟中涉及的程序架构、计算性能、算法优化、并行效率都提出了前所未有的挑战. 许多科研学者尝试借助异构平台的计算性能满足回旋动理学PIC代码日益增长的算力需求，在移植优化和数值算法上作出了诸多努力.

GTC代码是早期受益于异构并行计算的代码之一，基于CUDA在天河一号上展示2~3倍的加速^[9]. 基于OPENACC在Titan上展示了2~3倍的加速，在Summit上展示了3~4倍的加速^[10]. 基于Intel Xeon Phi加速器，在天河二号上展示了2~5倍的加速^[11]. ORB5代码基于OPENACC，在Tesla P100 GPU和Tesla V100 GPU的Summit中分别获得了4倍和5倍的加速^[12].

在上述研究中，通常着重考虑了等离子体中电子对模型的贡献，针对电子的模拟，凭借访存规则等优势可以获得较高的计算性能加速. 而聚变产物Alpha粒子与动理学离子类似，回旋半径较大，必须在回旋运动轨迹上进行回旋平均，从而带来大量非规则的网格数据访存，对访存性能提出了很高的要求. 文献显示在只有动理学离子和绝热电子的情况下，异构移植给整体性能带来了负面的优化^[13]. 考虑到聚变产物Alpha粒子的约束和输运是磁约束聚变能否成功的关键. 本文重点聚焦于以Alpha粒子为代表的回旋动理学代码的异构移植和性能优化.

1. 实验平台：天河新一代超算系统

本文的移植优化及分析测试在天河新一代超级计算机上进行. 天河新一代超级计算机使用异构处理器MT-3000^[14]，它包含16个CPU，4个加速集群（簇），96个控制核心和1 536个加速核心，理论计算密度高达145FLOPB. 每个加速核心以超长指令字（very long instruction word, VLIW）方式工作，每16个加速器核心和1个控制核心被组织成1个加速阵列，以SIMD指令控制. MT-3000具有混合的存储器层次结构，包括每个集群的GSM（6MB），HBSM（48MB），DDR（32GB）存储器，每个加速阵列的AM（768KB）和SM（64KB）片上存储器为加速核供给数据. 其架构如图1所示.

图 1 MT-3000架构图

Figure 1. The architecture diagram of MT-3000

下载: 全尺寸图片幻灯片

在异构处理器MT-3000上移植程序时有2个挑战：一方面，如何高效使用复杂的内存结构高效的将数据传递到加速阵列；另一方面，如何充分发挥高计算密度特性. 这2方面的挑战需要在程序移植优化时打破传统基于CPU的程序设计结构更多地强调计算性能的作用，从而实现整体性能的提高.

2. VirtEx代码热点分析及异构开发

VirtEx是基于PIC算法开发的回旋动理学模拟代码，已成功用于分析线性电阻撕裂不稳定性^[15]. 代码按照PIC方法，将带电粒子以拉格朗日法描述，对应在连续相空间的分布函数采样点；而场信息以欧拉法描述，采用结构化网格描述平衡场，采用非结构化网格描述扰动场^[16]. VirtEx代码的并行化策略是通过在环形方向上将模拟区域划分为不同的子域实现空间并行化，每个子域由1组进程管理. 该组中的每个进程拥有子区域内的场信息副本，并在该子域内将粒子按照进程编号进行并行划分.

VirtEx代码的主要结构如图2所示，其主循环使用2阶龙格-库塔算法，在每个循环中，通过函数Push更新粒子在相空间的位置，其可以更加细致的分为粒子对场信息的回旋平均函数PG（push gather）和粒子位置更新函数PI（push interpolation）；通过函数Locate计算粒子位置和扰动场网格之间插值的权重系数；通过函数Charge计算在非结构化扰动网格上的分布函数矩. 而其他热点部分主要是对非结构化网格上的扰动场更新和粒子MPI通信等操作. 其中3个函数Push，Locate，Charge为代码的热点，共占主循环时间的85%以上.

图 2 VirtEx代码的主要结构及热点分布

Figure 2. Main structure of the VirtEx code and hotspot distribution

下载: 全尺寸图片幻灯片

3个热点函数中涉及的算法如下所示：

算法1. 函数PushGather回旋平均算法.

输入：环向格点权重wzpart, 径向格点权重wppart, 极向格点权重wtpart, 格点编号jtpart, 扰动电场gradphi;

输出：回旋平均扰动场wpgc.

for (mp=0; mp<mpmax; mp++)/*粒子循环*/

for(igyro=0;igyro<ngyro;igyro++) /*回旋平均循环*/

读取粒子所在的格点权重及索引；

以索引读取gradphi；

计算临时变量e；

end for

累加计算wpgc，供函数PI使用*/

end for

算法2. 函数PushInterpolation粒子位置更新算法.

输入：相空间坐标zpart, 历史相空间坐标zpart0，回旋平均扰动场wpgc；

输出：相空间坐标zpart.

for (mp=0; mp<mpmax; mp++)/*粒子循环*/

读取粒子信息 zpart ,wpgc；

插值获取网格信息、电场、磁场等；

计算场对粒子的作用；

推动粒子更新速度位置信息；

end for

算法3. 函数Locate粒子到场的插值权重系数算法.

输入：相空间坐标zpart；

输出：环向格点权重wzpart, 径向格点权重wppart, 极向格点权重wtpart, 格点编号jtpart.

for (mp=0; mp<mpmax; mp++)/*粒子循环*/

for(igyro=0; igyro<ngyro; igyro++)/*回旋平均循环*/

读取粒子信息zpart；

读取网格信息；

计算粒子插值权重；

end for

算法4. 函数Charge非结构化扰动网格上的分布函数矩算法.

输入：环向格点权重wzpart, 径向格点权重wppart, 极向格点权重wtpart, 格点编号jtpart；

输出：电流密度density.

for (mp=0; mp<mpmax; mp++)/*粒子循环*/

插值获取网格信息、电场、磁场；

for(igyro=0; igyro<ngyro; igyro++)/*回旋平均循环*/

读取粒子插值权重；

计算粒子对于周围格点的扰动量；

粒子信息向网格上规约到density；

end for

上述3个热点函数中的4个算法外层循环体均围绕粒子展开，且粒子间具有良好的独立性，面向异构处理器MT-3000异构移植工作主要围绕粒子循环的向量指令集改写展开.

同时，为了更好适配向量指令集的访存特性，在数据结构上做了改写，将粒子数据使用SOA（struct of array）数据结构标识，网格数据使用AOS（array of struct）数据结构. 粒子数据具有数量多，独立性好的特性，配合SOA数据结构更适用于发挥向量指令运算的优势；而网格数据数量远远小于粒子数，访存量巨大，AOS的数据结构能够充分发挥内存局部性. 针对数据结构的改写工作为后续程序的性能优化提供了重要的保障.

3. 面向高计算密度异构设备的性能优化策略

基于上述对于程序热点函数的分析，回旋动理学PIC数值模拟算法涉及粒子与网格数据间的大量访存，尤其在面向扰动场网格数据的访存操作中存在非规则访问和原子写操作，二者对于访存性能提出了艰难的挑战，几个热点函数的访存与计算量统计如表1所示.

表 1 VirtEx热点函数的初始计算密度统计

Table 1. Initial calculated density statistics of VirtEx hot spot function

函数	浮点计算量/FLO	访存量/B	计算密度/FLOPB
PG	269mp	232mp	1.15
PI	462mp	224mp	1.98
Locate	238mp	200mp	1.17
Charge	158mp	200mp	0.75
注：变量mp表示粒子数量，变量前系数为热点函数中每个粒子计算访存量的统计值.

下载: 导出CSV

| 显示表格

因此，如何将计算密度在1~2 FLOPB的访存密集型模块，通过性能优化策略发挥高计算密度型异构设备的计算性能，是关键性的研究内容，也是本文的研究重点. 在本章中通过中间变量的即时计算，基于SM片上存储的软件缓存设计，热点函数合并3种优化方法展开介绍.

3.1 中间变量的即时计算

在传统基于CPU的程序设计中，开发者更倾向于主动寻找公用数据预先计算并暂存于内存中，利用多级高速缓存，通过索引获取数据，通过增加访存量换取计算量的减少. 然而，这种优化方法并不适合于基于宽向量计算的高计算密度型异构设备，大量引入访存会限制计算能力的发挥，同时使用索引的非规则访存模式也不适用于向量计算. 因此，考虑到新架构的特点，本文采用了与传统方法截然相反的优化方法来提高计算性能.

在VirtEx中，磁场、温度、密度、安全因子等中间变量可以将预计算转换为即时计算，引入热点函数中，按照每个粒子对中间变量的需求完成计算. 该操作可以有效减少热点函数中的规则访存和非规则访存，降低流水线中断次数，避免由于按索引访问所带来的向量重组操作.

通过热点函数分析，可以进行优化的中间变量重要分为2类. 一类以每个径向网格上的极向网格点数mtheta为例，该函数可以在热点函数中完成即时计算：

${mtheta}_{i}=2Floor\left(\frac{\mathrm{\pi }{r}_{i}}{\mathrm{\Delta }l}+ 0.5\right).$

(1)

另一类中间变量却难以直接解析化表达，例如粒子在非结构化扰动场网格中的位置索引信息igrid，其形式为

${igrid}_{i}=1+\sum_{j=0}^{i-1}{mtheta}_{i},$

(2)

${mtheta}_{i}=\frac{2\text{π}r}{\mathrm{\Delta }l}+{\delta }_{i}=ai+b+{\delta }_{i}.$

(3)

如式（2）所示，变量igrid的计算基于变量mtheta的累加式，而由于函数Floor引入的不连续性，导致变量igrid的数学公式不能通过简单的变换和积分得出.

由于极向格点数远大于1，且径向格点在 $r$ 坐标描述下是均匀的，当残差 ${\delta }_{i}\ll 1$ ，igrid同样可以表示为

${igrid}_{i}=a{i}^{2}+bi+c+{r}_{i},$

(4)

其中残差 $r$ 远小于二次函数部分. 为了能够构建igrid的解析表达式，采用多项式来拟合二次函数的部分，而残差可以通过周期函数 $f$ 来降低到0.5以下，如图3所示. 从而igrid的解析表达式可以表示为如下的形式：

图 3 位置索引变量igrid真实值与数值拟合的对比

Figure 3. Comparison of the real value and numerical fitting of the location index variable igrid

下载: 全尺寸图片幻灯片

${igrid}_{i}=Round\left[a{i}^{2}+bi+c+f\left(i\right)\right].$

(5)

得益于对平衡剖面信息的解析化表达和即时计算，函数PushInterpolation和函数Locate中的随机内存访问过程得到减少. 只有热点函数PushGather中存在针对扰动场回旋平均的随机内存访问，在下面的章节中会论述相应的优化方法.

3.2 基于SM片上存储的软件缓存设计

在基于CPU的通用架构中，内置的缓存机制允许开发者在编程时无需关注高速缓存，更多的是将其视为自动化的访存系统. 而在MT-3000处理器中，考虑到性能，内存和SM/AM之间，以及SM/AM和向量寄存器之间的数据交换需要由程序员手动控制. 在处理内存的随机访问时，依赖DMA接口操作需要依赖索引和数据，造成了内存带宽的浪费. 为了解决这个问题，本文针对加速阵列内部片上存储SM设计软缓存机制，充分发挥内存结构和内存局部性的优势.

在VirtEx热点函数中有2个非规则访问，其中一个是在函数Push中涉及到对扰动场网格数据的非规则访问，另一个是在函数Charge中涉及到对扰动场网格数据更新的原子写操作.

函数Charge通过累加操作(+=)将粒子信息到网格上，由于粒子分散在子域内的多个进程，且网格数远小于粒子数，这将涉及到原子操作. 读/写锁是MT-3000处理器中解决数据竞争的重要方法，因此基于读/写锁设计了1种多级同步的软件缓存机制，首先在SM中进行细粒度(如单字)更新，不涉及任何同步操作；其次，使用读写锁保证缓存块在被换出时不会受到数据竞争. 同时完成缓存块从SM到主存储器的累加操作.

函数PushGaher主要通过4点回旋平均算法获取粒子在回旋运动轨迹上的扰动场信息. 由于片上缓存空间有限，回旋平均算法的随机访问性质会对主存带来巨大的访存开销. 因此基于片上SM存储设计了1种软件缓存机制，该机制通过粒子索引将网格数据按照缓存块读入，如果向量宽度内所有粒子的索引均在缓存块内命中，组装网格数据向量传到向量寄存器完成向量计算；如果索引未在缓存块命中，按照所需索引完成缓存块数据的更新. 同时考虑到性能和局部性的平衡，设计64个缓存块并使用哈希作为缓存块的标识.

在软件缓存机制的实施后，非规则访存被有效转化，访存带宽的压力得到了缓解. 为缓存命中问题. 进一步地，考虑到回旋平均算法需获取轨迹上每1点的扰动场信息，由于粒子在速度空间分布的随机性，在更新粒子位置后，极坐标方向的粒子分布会被分散，从而扰乱粒子在非结构化扰动场网格上的分布. 程序现有的基于粒子所在径向网格点的排序算法，由于加速阵列中的片上存储空间有限，该算法不足以支撑高计算密度的异构设备，导致缓存命中率的降低.

图4显示了排序算法优化前后，粒子序号与相应的非结构化网格序号之间的关系，其中psi排序是原始的径向排序算法，igrid排序是改进的排序算法，按照粒子所在的网格点排序，增强了空间局部性. 优化后的排序采用桶式排序算法，每个桶对应于粒子所属的网格点，由于粒子运动的对称性，每个桶的容量总是与每个网格的粒子数同序，因此该算法的复杂性与原来的psi排序同样是O(N).

图 4 不同排序算法下的粒子格点编号对比

Figure 4. Comparison of particle lattice numbers under different sorting algorithms

下载: 全尺寸图片幻灯片

不同排序算法下针对扰动场变量gradphi的缓存命中率，如表2所示，在64个缓存块和1 024 B缓存块大小的情况下，扰动场变量gradphi在没有粒子排序的情况下命中率为77.99%，接近于psi排序下的84.47%，而采用igrid排序可以获得99.15%的缓存命中率，得益于超高的缓存命中率，针对变量gradphi的非规则访问可以被近似视作规则访问.

表 2 不同排序算法下针对扰动场变量gradphi的缓存命中率

Table 2. Cache Hit Rate for Disturbance Field Variable gradphi Under Different Sorting Algorithms

排序算法	缓存命中率/%
不排序	77.99
psi排序	84.47
igrid排序	99.15

下载: 导出CSV

| 显示表格

3.3 热点函数合并

通过热点函数面向异构加速器MT-3000的移植以及上述几种优化方式的应用. 非规则访存操作已经被近似消除，减轻了访存带宽的压力. 在经过优化后，热点函数PG，PI，Locate的浮点计算量、访存量以及计算密度的统计数据如表3所示，其中mp表示粒子数量，考虑到每个粒子相同的操作，其在统计中作为系数表示. 从数据上可以看出，由于函数PG中的回旋平均操作主要涉及内存访问，其计算密度仅为1.39；而时间占比最高的函数PI，考虑到基于粒子的计算特点，计算密度仅为12.4；而函数Locate在经过变量即时计算优化后，计算密度达到56.3. 综上所述，时间占比高达40%的函数Push的计算密度需要进一步提高计算访存比.

表 3 热点函数合并优化后计算密度统计

Table 3. Hot Spot Function is Merged and Optimized to Calculate the Density Statistics

函数	浮点计算量/FLO	访存量/B	计算密度/FLOPB
PG	277mp	198.64mp	1.39
PI	1 888mp	152mp	12.4
Locate	12 161mp	216mp	56.3
PushOpt	14 326mp	134.64mp	106.4
注：变量mp表示粒子数量，变量前系数为热点函数中每个粒子计算访存量的统计值.

下载: 导出CSV

| 显示表格

函数PG，PI，Locate在PIC算法中是计算粒子运动的3个相关函数，函数Locate负责计算插值系数，函数PG负责获取网格数据，函数PI负责推动粒子，三者在算法上具备可合并性. 将函数Locate引入到函数Push中，并将函数PG和PI合并，合并后输入仅为粒子信息和网格信息，输出为粒子信息，减少了对于大量中间变量的读写. 优化函数PushOpt的计算密度达到106.4 FLOPB，进一步缩小了与理论值的差距.

4. 优化性能测试及分析

4.1 中等规模基准算例性能测试

在该这个基准算例测试中，我们用1个MPI进程控制1个MT-3000加速集群（簇），在天河新一代超算系统上使用120个节点上的480个MPI进程和480个簇. 该基准测试使用了1.23 × 10⁶个网格，模拟了2.5 × 10⁹个粒子.

表4显示了CPU版本和优化版本之间在主循环和热点函数上的性能对比，CPU版本的3个主要的热点函数的占比达到86.06%. 结果显示，基于MT-3000处理器的应用加速效果良好，总体速度提高了4.2倍，其中函数Push和函数Locate分别实现了10.9倍和13.3倍的加速，在具有原子操作的函数Charge实现了16.2倍的性能提升.

表 4 基准算例的性能表现

Table 4. The Performance of Benchmark Examples

热点函数	CPU版本		优化后版本		加速比
热点函数	计算时间/s	占比/%	计算时间/s	占比/%	加速比
主循环	845.63	100	201.46	100	4.2
Push	323.86	38.30	29.64	14.71	10.9
Locate	128.69	15.22	9.67	4.80	13.3
Charge	275.19	32.54	16.98	8.43	16.2

下载: 导出CSV

| 显示表格

4.2 扩展性测试

本节展示了优化后的VirtEx程序的弱扩展性测试结果. 在弱扩展性测试中，基准测试为120个节点，使用了3.86 × 10⁵个网格，模拟了3.7 × 10⁹个粒子. 随着节点数增加至3 840个，模拟的粒子数也相应的增加到了1.18 × 10¹¹. 经过多轮测试取平均后的并行效率，如图5所示，在天河新一代超算系统的3 840个节点5 898 240个加速器核心上，其并行效率为88.4%，展示了良好的弱扩展性.

图 5 120个节点到3 840个节点的弱扩展性测试结果

Figure 5. Weak scalability test results from 120 to 3 840 nodes

下载: 全尺寸图片幻灯片

5. 结　　论

基于天河新一代超算系统的异构加速器MT-3000对大规模并行磁约束聚变回旋动理学模拟代码VirtEx进行代码移植和性能优化，围绕高计算密度型系统和访存密集型应用间存在的矛盾. 通过中间变量的即时计算、定制化的软件缓存设计、空间局部性优化、热点函数合并等优化策略，并通过数据分析验证了优化的合理性. 同时在基准测试中，VirtEx的优化显示了良好的加速效果，其中函数Push提速10.9倍，函数Locate提速13.3倍，函数Charge提速16.2倍，从而使整个程序提速4.2倍. 并且在3 840个节点的5 898 240个加速器核心上展示了良好的可扩展性，并行效率为88.4%.

作者贡献声明：李青峰负责程序设计、移植、测试，并撰写论文；李跃岩负责设计并实现优化算法；栾钟治负责程序瓶颈分析和解决方案提供；张文禄提供了针对程序原理和算法方面的指导；龚春叶提供了针对异构加速设备的优化指导；郑刚提供了系统测试环境及保障工作；康波提供了共性技术的指导；孟祥飞负责设计研究方案并把控研究进度.

图 1 深度学习推荐模型系统架构

Figure 1. Architectural of deep learning recommendation model system

下载: 全尺寸图片幻灯片

图 2 深度学习推荐模型训练过程中的Miss Pull、Update Push和Evict Push操作

Figure 2. Miss Pull, Update Push, and Evict Push transmission operations in DLRM training

下载: 全尺寸图片幻灯片

图 3 ESD中的嵌入样本调度过程概览

Figure 3. Overview of embedding samples dispatching process in ESD

下载: 全尺寸图片幻灯片

图 4 匈牙利算法流程图

Figure 4. Flow chart of Hungarian algorithm

下载: 全尺寸图片幻灯片

图 5 总体性能

Figure 5. Overall performance

下载: 全尺寸图片幻灯片

图 6 命中率和传输操作组成

Figure 6. Hit ratio and ingredient of transmission operations

下载: 全尺寸图片幻灯片

图 7 代价降低和GPU资源消耗

Figure 7. Cost reduction and GPU resource consumption

下载: 全尺寸图片幻灯片

图 8 每工作节点批量大小对性能的影响

Figure 8. Impact of batch size per worker on performance

下载: 全尺寸图片幻灯片

图 9 缓存比例对性能的影响

Figure 9. Impact of cache ratio on performance

下载: 全尺寸图片幻灯片

图 10 嵌入大小对性能的影响

Figure 10. Impact of embedding size on performance

下载: 全尺寸图片幻灯片

图 11 当使用4个工作节点时的实验结果

Figure 11. Experiment results when using four workers

下载: 全尺寸图片幻灯片

表 1 符号列表

Table 1 List of symbols

符号	描述
$\mathcal{W}$	边缘工作节点集合
${\mathcal{E}}_{i}$	迭代 ${I}_{i}$ 的输入嵌入样本， ${\mathcal{E}}_{i}=\{{E}_{1},{E}_{2},… ,{E}_{m\times n}\}$
${I}_{i}$	第 $i$ 个训练迭代
$m$	每个工作节点的批量大小
${E}_{i}$	一个嵌入样本， ${E}_{i}=\{{x}_{1},{x}_{2},{x}_{3},… \}$
${x}_{i}$	一个嵌入样本的ID
$\boldsymbol{E}\boldsymbol{m}\boldsymbol{b}\left({\boldsymbol{x}}_{\boldsymbol{i}}\right)$	嵌入样本ID为 ${x}_{i}$ 对应的嵌入值（向量）
${D}_{tran}$	一个嵌入的数据量
${B}_{w}^{j}$	工作节点 ${w}_{j}\mathrm{和}\mathrm{参}\mathrm{数}\mathrm{服}\mathrm{务}\mathrm{器}\mathrm{间}$ 的网络带宽
${T}_{tran}^{j}$	工作节点 ${w}_{j}\mathrm{和}\mathrm{参}\mathrm{数}\mathrm{服}\mathrm{务}\mathrm{器}\mathrm{间}$ 传输一个嵌入的代价， ${T}_{tran}^{j}=\dfrac{{D}_{tran}}{{B}_{w}^{j}}$

下载: 导出CSV

表 2 在使用8个工作节点的情况下，不同批量大小，串行与并行实现的匈牙利算法执行时间

Table 2 Execution Time Comparison Between Serial and Parallel Implementations of Hungarian Algorithm for Different Batch Size Per Worker when Using 8 Workers ms

每节点批量大小	32	64	128	256	512	1024
CPU串行	9	62	528	3360	50976	134986
GPU并行	21	28	82	186	811	1385

下载: 导出CSV

表 3 实验所用负载

Table 3 Workloads in Experiment

负载序号	所用模型	数据集
S1	WDL^[26]	Criteo Kaggle^[36]
S2	DFM^[27]	Avazu^[79]
S3	DCN^[68]	Criteo Sponsored Search^[80]

下载: 导出CSV

参考文献(91)

[1]	Gu Yulong, Bao Wentian, Ou Dan, et al. Self-supervised learning on users’ spontaneous behaviors for multi-scenario ranking in e-commerce[C]//Proc of the 30th ACM Int Conf on Information & Knowledge Management. New York: ACM, 2021: 3828—3837
[2]	Wang Jizhe, Huang Pipei, Zhao Huan, et al. Billion-scale commodity embedding for e-commerce recommendation in alibaba[C]//Proc of the 24th ACM SIGKDD Int Conf on Knowledge Discovery & Data Mining. New York: ACM, 2018: 839–848
[3]	Smith B and Linden G. Two decades of recommender systems at amazon. Com[J]. IEEE Internet Computing, 2017, 21(3): 12−18 doi: 10.1109/MIC.2017.72
[4]	Gomez-Uribe C and Hunt N. The netflix recommender system: Algorithms, business value, and innovation[J]. ACM Trans on Management Information System, 2015, 6(4): 1−19
[5]	Covington P, Adams J, and Sargin E. Deep neural networks for youtube recommendations[C]//Proc of the 10th ACM Conf on Recommender Systems. New York: ACM, 2016: 191−198
[6]	Schedl M, Knees P, and Gouyon F. New paths in music recommender systems research[C]//Proc of the 11th ACM Conf on Recommender Systems. New York: ACM, 2017: 392−393
[7]	Sharma A, Jiang J, Bommannavar P, et al. Graphjet: Real-time content recommendations at twitter[J]. Proc of the VLDB Endowment, 2016, 9(13): 1281−1292 doi: 10.14778/3007263.3007267
[8]	Boeker M and Urman A. An empirical investigation of personalization factors on tiktok[C]//Proc of the ACM Web Conf 2022. New York: ACM, 2022: 2298−2309
[9]	Ying R, He Ruining, Chen Kaifeng, et al. Graph convolutional neural networks for web-scale recommender systems[C]//Proc of the 24th ACM SIGKDD Int Conf on Knowledge Discovery & Data Mining. New York: ACM, 2018: 974−983
[10]	彭迎涛,孟小峰,杜治娟. 多样化推荐综述[J]. 计算机研究与发展,2025,62(2):285−313 doi: 10.7544/issn1000-1239.202330600 Peng Yingtao, Meng Xiaofeng, Du Zhijuan. Survey on Diversified Recommendation[J]. Journal of Computer Research and Development, 2025, 62(2): 285−313 (in Chinese) doi: 10.7544/issn1000-1239.202330600
[11]	Wang Siqi, Feng Tianyu, Yang Hailong, et al. Atrec: Accelerating recommendation model training on cpus[J]. IEEE Trans on Parallel and Distributed Systems, 2024, 35(6): 905−918 doi: 10.1109/TPDS.2024.3381186
[12]	Sayed A, Himeur Y, Alsalemi A, et al. Intelligent edge-based recommender system for internet of energy applications[J]. IEEE Systems Journal, 2021, 16(3): 5001−5010
[13]	Himeur Y, Alsalemi A, Al-Kababji A, et al. A survey of recommender systems for energy efficiency in buildings: Principles, challenges and prospects[J]. Information Fusion, 2021, 72: 1−21 doi: 10.1016/j.inffus.2021.02.002
[14]	Pourpanah F and Etemad A. Exploring the landscape of ubiquitous in-home health monitoring: a comprehensive survey[J]. ACM Trans on Computing for Healthcare, 2024, 5(4): 1−43.
[15]	Su Xin, Giancarlo S, Vincenzo M, Antonio Picariello, et al. An edge intelligence empowered recommender system enabling cultural heritage applications[J]. IEEE Trans on Industrial Informatics, 2019, 15(7): 4266−4275 doi: 10.1109/TII.2019.2908056
[16]	Yin Hongzhi, Chen Tong, Qu Liang, et al. On-device recommender systems: A tutorial on the new-generation recommendation paradigm[C]//Proc of the ACM Web Conf 2024. New York: ACM, 2024: 1280−1283
[17]	Cai Qiqi, Cao Jian, Xu Guandong, et al. Distributed recommendation systems: survey and research directions[J]. ACM Trans on Information Systems, 2024, 43(1): 1−38,
[18]	Long Jing, Ye Guanhua, Chen Tong, et al. Diffusion-based cloud-edge-device collaborative learning for next poi recommendations[C]//Proc of the 30th ACM SIGKDD Int Conf on Knowledge Discovery & Data Mining. New York: ACM, 2024: 1324−1337
[19]	Yuan Wei, Qu Liang, Cui Lizhen, et al. Hetefedrec: Federated recommender systems with model heterogeneity[C]//Proc of the 40th Int Conf on Data Engineering. Piscataway, NJ: IEEE, 2024: 2976−2987
[20]	Yongbo Yu, Fuxun Yu, Xiang Sheng, et al. Eaglerec: Edge-scale recommendation system acceleration with inter-stage parallelism optimization on gpus[C]//Proc of the 60th Design Automation Conf. Piscataway, NJ: IEEE, 2023: 1−6
[21]	Gong Yu, Jiang Ziwen, Feng Yufei, et al. Edgerec: recommender system on edge in mobile taobao[C]//Proc of the 29th ACM Int Conf on Information & Knowledge Management. New York: ACM, 2020: 2477−2484
[22]	Himeur Y, Sohail S, Bensaali F, et al. Latest trends of security and privacy in recommender systems: a comprehensive review and future perspectives[J]. Computers & Security, 2022, 118: 102746,
[23]	Guo Yeting, Liu Fang, Cai Zhiping, et al. Prefer: Point-of-interest recommendation with efficiency and privacy-preservation via federated edge learning[J]. Proc of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 2021, 5(1): 1−25
[24]	Li Youhuizi, Yu Haitao, Zeng Yan, et al. Hfsa: A semi-asynchronous hierarchical federated recommendation system in smart city[J]. IEEE Internet of Things Journal, 2023, 10(21): 18808−18820 doi: 10.1109/JIOT.2023.3281909
[25]	Wu Jiang, Yang Yunchao, Hu Miao, etal. FCER: A federated cloud-edge recommendation framework with cluster-based edge selection[J]. IEEE Trans on Mobile Computing, 2025, 24(3): 1731−1743 doi: 10.1109/TMC.2024.3484493
[26]	Cheng Heng-Tze, Koc L, Harmsen J, et al. Wide & deep learning for recommender systems[C]//Proc of the 1st workshop on deep learning for recommender systems. New York: ACM, 2016: 7−10
[27]	Guo Huifeng, Tang Ruiming, Ye Yunming, et al. Deepfm: a factorization-machine based neural network for ctr prediction[J]. arXiv preprint arXiv: 1703.04247, 2017
[28]	Jiang Jiazhi, Tian Rui, Du Jiangsu, et al. Mixrec: Orchestrating concurrent recommendation model training on cpu-gpu platform[C]//Proc of the 41st Int Conf on Computer Design. Piscataway, NJ: IEEE, 2023: 366−374
[29]	Guo Huifeng, Guo Wei, Gao Yong, et al. Scalefreectr: Mixcache-based distributed training system for ctr models with huge embedding table[C]//Proc of the 44th Int Conf on Research and Development in Information Retrieval. New York: ACM, 2021: 129−1278
[30]	Zhao Xiangyu, Wang Maolin, Zhao Xinjian, et al. Embedding in recommender systems: A survey[J]. arXiv preprint arXiv: 2310.18608, 2023.
[31]	Zhang Hailin, Liu Zirui, Chen Boxuan, et al. Cafe: Towards compact, adaptive, and fast embedding for large-scale recommendation models[J]. Proc of the ACM on Management of Data, 2024, 2(1): 1−28
[32]	苗旭鹏,张敏旭,邵蓥侠,等. PS-Hybrid:面向大规模推荐模型训练的混合通信框架[J]. 清华大学学报(自然科学版),2022,62(9):1417−1425 Miao Xupeng, Zhang Minxu, Shao Yingxia, etal. PS-Hybrid: Hybrid communication framework for large recommendation model training[J]. Journal of Tsinghua University(Science and Technology), 2022, 62(9): 1417−1425 (in Chinese)
[33]	Zhang Yuanxing, Chen Langshi, Yang Siran, et al. Picasso: Unleashing the potential of gpu-centric training for wide-and-deep recommender systems[C]//Proc of the 38th Int Conf on Data Engineering. Piscataway, NJ: IEEE, 2022: 3453−3466
[34]	Acun B, Murphy M, Wang Xiaodong, et al. Understanding training efficiency of deep learning recommendation models at scale[C]//Proc of the 27th IEEE Int Symp on High-Performance Computer Architecture. Piscataway, NJ: IEEE, 2021: 802−814
[35]	Song Xiaoniu, Zhang Yiwen, Chen Rong, et al. Ugache: A unified gpu cache for embedding-based deep learning[C]//Proc of the 29th Symp on Operating Systems Principles. New York: ACM, 2023: 627−641
[36]	Kaggle. Click-through rate prediction[EB/OL]. [2025-2-24]. https://www.kaggle.com/c/avazu-ctr-prediction.
[37]	Zeng Chaoliang, Liao Xudong, Cheng Xiaodian, et al. Accelerating neural recommendation training with embedding scheduling[C]//Proc of the 21st USENIX Symp on Networked Systems Design and Implementation. Berkeley, CA: Association, 2024: 1141−1156
[38]	Agarwal S, Yan Chengpo, Zhang Ziyi, et al. Bagpipe: Accelerating deep recommendation model training[C]//Proc of the 29th Symp on Operating Systems Principles. New York: ACM, 2023: 348−363
[39]	Youngeun K and Minsoo R. Training personalized recommendation systems from gpu scratch: Look forward not backwards[C]//Proc of the 49th Annual Int Symp on Computer Architecture. New York: ACM, 2022: 860−873
[40]	Adam P, Sam G, Francisco M, et al. Pytorch: An imperative style, high-performance deep learning library[C]//Proc of the 33rd Inter Conf on Neural Information Processing Systems. Red Hook: Curran Associates Inc, 2019: 8026−8037
[41]	Ma Kaihao, Yan Xiao, Cai Zhenkun, et al. Fec: Efficient deep recommendation model training with flexible embedding communication[J]. Proc of the ACM on Management of Data, 2023, 1(2): 1−21,
[42]	Saeed G, Lan Guanghui, and Zhang Hongchao. Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization[J]. Mathematical Programming, 2016, 155(1): 267−305
[43]	Chanwon P and Jemin L. Mobile edge computing-enabled heterogeneous networks[J]. IEEE Trans on Wireless Communications, 2020, 20(2): 1038−1051
[44]	Li Yun, Ma Hui, Wang Lei, et al. Optimized content caching and user association for edge computing in densely deployed heterogeneous networks[J]. IEEE Trans on Mobile Computing, 2020, 21(6): 2130−2142
[45]	Taegeon U, Byungsoo O, Minyoung K, et al. Metis: Fast automatic distributed training on heterogeneous gpu[C]//Proc of the 2024 USENIX Annual Technical Conf. Berkeley, CA: USENIX Association, 2024: 563−578
[46]	Ling Neiwen, Wang Kai, He Yuze, et al. Rt-mdl: Supporting real-time mixed deep learning tasks on edge platforms[C]//Proc of the 19th ACM Conf on Embedded Networked Sensor Systems. New York: ACM, 2021: 1−14
[47]	Z K, Xu Qiang, Meng Jiayi, et al. Accumo: Accuracy-centric multitask offloading in edge-assisted mobile augmented reality[C]//Proc of the 29th Annual Int Conf on Mobile Computing and Networking. New York: ACM, 2023: 1−16
[48]	Zhao M, Choudhary D, Tyagi D, et al. Recd: Deduplication for end-to-end deep learning recommendation model training infrastructure[J]. arXiv preprint, arXiv: 2211.05239, 2022
[49]	Miao Xupeng, Zhang Hailin, Shi Yining, et al. Het: scaling out huge embedding model training via cache-enabled distributed framework[J]. Proc of the VLDB Endowment, 2021, 15(2): 312−320 doi: 10.14778/3489496.3489511
[50]	Adnan M, Maboud Y, Mahajan D, et al. Accelerating recommendation system training by leveraging popular choices[J]. Proc of the VLDB Endowment, 2021, 15(1): 127−140 doi: 10.14778/3485450.3485462
[51]	Wang Chunnan, Wang Hongzhi, Wang Junzhe, et al. Autosr: Automatic sequential recommendation system design[J]. IEEE Trans on Knowledge and Data Engineering, 2024, 36(11): 5647−5660 doi: 10.1109/TKDE.2024.3400031
[52]	Li Jiayu, He Zhiyu, Cui Yumeng, et al. Towards ubiquitous personalized music recommendation with smart bracelets[J]. Proc of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 2022, 6(3): 1−34
[53]	Wang Qinyong, Yin Hongzhi, Chen Tong, et al. Next point-of-interest recommendation on resource-constrained mobile devices[C]//Proc of the ACM Web Conf 2020. New York: ACM, 2020: 906−916
[54]	Long Jing, Chen Tong, Nguyen Q, et al. Decentralized collaborative learning framework for next poi recommendation[J]. ACM Trans on Information Systems, 2023, 41(3): 1−25
[55]	Muhammad K, Wang Q, O'Reilly-Morgan D, et al. Fedfast: Going beyond average for faster training of federated recommender systems[C]//Proc of the 26th ACM SIGKDD Int Conf on Knowledge Discovery & Data Mining. New York: ACM, 2020: 1234−1242
[56]	Sun Zehua, Xu Yonghui, Liu Yong, et al. A survey on federated recommendation systems[J]. IEEE Trans on Neural Networks and Learning Systems, 2024, 36(1): 6−20
[57]	Zhang Chunxu, Long Guodong, Zhou Tianyi, et al. Gpfedrec: Graph-guided personalization for federated recommendation[C]//Proc of the 30th ACM SIGKDD Int Conf on Knowledge Discovery & Data Mining. New York: ACM, 2024: 4134−4142
[58]	Ding Yuchen, Zhang Siqing, Fan Boyu, et al. Fedloca: Low-rank coordinated adaptation with knowledge decoupling for federated recommendations[C]//Proc of the 18th ACM Conf on Recommender Systems. New York: ACM, 2024: 690−700
[59]	Belal Y, Bellet A, Mokhtar S B, et al. Pepper: Empowering user-centric recommender systems over gossip learning[J]. Proc of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 2022, 6(3): 1−27
[60]	Xia S, Wei P, Liu Yanchen, et al. Reca: A multi-task deep reinforcement learning-based recommender system for co-optimizing energy, comfort and air quality in commercial building[C]//Proc of the 10th ACM Int Conf on Systems for Energy-Efficient Buildings, Cities, and Transportation. New York: ACM, 2023: 99−109
[61]	Gao Ye, Ma Meiyi, Gordon K, et al. A monitoring, modeling, and interactive recommendation system for in-home caregivers: Demo abstract[C]//Proc of the 18th ACM Conf on Embedded Networked Sensor Systems. New York: ACM, 2020: 587−588
[62]	Matam K, Ramezani H, Wang Fan, et al. Quickupdate: a real-time personalization system for large-scale recommendation model[C]//Proc of the 21st USENIX Symp on Networked Systems Design and Implementation. Berkeley, CA: USENIX Association, 2024: 731−744
[63]	Wang Zheng, Wang Yuke, Deng Jiaqi, et al. Rap: Resource-aware automated gpu sharing for multi-gpu recommendation model training and input preprocessing[C]//Proc of the 29th ACM Int Conf on Architectural Support for Programming Languages and Operating Systems. New York: ACM, 2024: 964−979
[64]	Yang Chen, Chen Jin, Yu Qian, et al. An incremental update framework for online recommenders with data-driven prior[C]//Proc of the 32th ACM Int Conf on Information & Knowledge Management. New York: ACM, 2023: 4894−4900
[65]	Sima C, Fu Y, Sit M K, et al. Ekko: Alarge-scale deep learning recommender system with low-latency model update[C]//Proc of the 16th USENIX Symp on Operating Systems Design and Implementation. Berkeley, CA: USENIX Association, 2022: 821−839
[66]	Yu Keping, Guo Zhiwei, Shen Yu, et al. Secure artificial intelligence of things for implicit group recommendations[J]. IEEE Internet of Things Journal, 2021, 9(4): 2698−2707
[67]	Deng Yongheng, Wang Guanbo, Yu Sheng, et al. Relayrec: Empowering privacy-preserving ctr prediction via cloud-device relay learning[C]//Proc of the 23rd ACM/IEEE Int Conf on Information Processing in Sensor Networks. Piscataway, NJ: IEEE, 2024: 188−199
[68]	Wang Ruoxi, Fu Bin, Fu Gang, et al. Deep & cross network for ad click predictions[C]//Proc of the 23rd ACM SIGKDD Int Conf on Knowledge Discovery and Data Mining. New York: ACM, 2017: 1−7
[69]	Chen Wenqiang, Zhan Lizhang, Ci Yuanlong, et al. Flen: leveraging field for scalable ctr prediction[J]. arXiv preprint arXiv: 1911.04690, 2019
[70]	Adnan M, Maboud Y E, Mahajan D, et al. Heterogeneous acceleration pipeline for recommendation system training[C]//Proc of the 51st Annual Int Symp on Computer Architecture. Piscataway, NJ: IEEE, 2024: 1063−1079
[71]	贺巩山,赵传磊,蒋金虎,等. 面向深度学习的数据存储技术综述[J/OL]. 计算机学报,2025. He Gongshan, Zhao Chuanlei, Jiang Jinhu, etal. A survey of data storage technologies for deep learning[J/OL]. Chinese Journal of Computers, 2025 (in Chinese)
[72]	Xie Minhui, Lu Youyou, Wang Qing, et al. Petps: Supporting huge embedding models with persistent memory[J]. Proc of the VLDB Endowment, 2023, 16(5): 1013−1022 doi: 10.14778/3579075.3579077
[73]	Wei Yingcan, Langer M, Yu Fan, et al. A gpu-specialized inference parameter server for large-scale deep recommendation models[C]//Proc of the 16th ACM Conf on Recommender Systems. New York: ACM, 2022
[74]	Goyal P, Dollár P, Girshick R, et al. Accurate, large minibatch sgd: Training imagenet in 1 hour[J]. arXiv preprint arXiv: 1706.02677, 2017.
[75]	Kuhn H. The hungarian method for the assignment problem[J]. Naval research logistics quarterly, 1955, 2(1-2): 83−97 doi: 10.1002/nav.3800020109
[76]	Lopes P, Yadav S, Ilic A, et al. Fast block distributed CUDA implementation of the hungarian algorithm[J]. Journal of Parallel and Distributed Computing, 2019, 130: 50−62. doi: 10.1016/j.jpdc.2019.03.014
[77]	Lawler E. Combinatorial optimization: networks and matroids[M]. Courier Corporation, 2001.
[78]	Munkres J. Algorithms for the assignment and transportation problems[J]. Journal of the society for industrial and applied mathematics, 1957, 5(1): 32−38 doi: 10.1137/0105003
[79]	Kaggle. Display advertising challenge[EB/OL]. [2025-02-23]. https://www.kaggle.com/c/criteo-display-ad-challenge
[80]	Tallis M and Yadav P. Reacting to variations in product demand: An application for conversion rate (cr) prediction in sponsored search[J]. arXiv preprint, arXiv: 1896.08211, 2018
[81]	Delestrac P, Battacharjee D, Yang Simei, et al. Multi-level analysis of gpu utilization in ml training workloads[C]//Proc of 2024 Design, Automation & Test in Europe Conf & Exhibition. Piscataway, NJ: IEEE, 2024: 1−6
[82]	Shubha S, Shen Haiying, and Iyer A. Usher: Holistic interference avoidance for resource optimized ml inference[C]//Proc of the 18th USENIX Symp on Operating Systems Design and Implementation. Berkeley, CA: USENIX Association, 2024: 947−964
[83]	Yuan Wei, Yang Chaoqun, Qu Liang, et al. Hide your model: A parameter transmission-free federated recommender system[C]//Proc of the 40th Int Conf on Data Engineering. Piscataway, NJ: IEEE, 2024: 611−624
[84]	Zhang Ye, Deng Yongheng, Yue Sheng, et al. Dualrec: A collaborative training framework for device and cloud recommendation models[J]. IEEE Trans on Mobile Computing, 2025
[85]	Lian Xiangru, Yuan Binhang, Zhu Xuefeng, et al. Persia: An open, hybrid system scaling deep learning-based recommenders up to 100 trillion parameters[C]//Proc of the 28th ACM SIGKDD Int Conf on Knowledge Discovery & Data Mining. New York: ACM, 2022: 3288−3298
[86]	Lai Fan, Zhang Wei, Liu Rui, et al. Adaembed: Adaptive embedding for large-scale recommendation models[C]//Proc of the 17th USENIX Symp on Operating Systems Design and Implementation. Berkeley, CA: USENIX Association, 2023: 817−831
[87]	Zhao Xiangyu, Liu Haochen, Fan Wenqi, et al. Autoemb: Automated embedding dimensionality search in streaming recommendations[C]//Proc of the 21st Int Conf on Data Mining. Piscataway, NJ: IEEE, 2021: 896−905
[88]	Luo Qinyi, Wang Penghan, Zhang Wei, et al. Fine-grained embedding dimension optimization during training for recommender systems[J]. arXiv preprint arXiv: 2401.04408, 2024.
[89]	Bahreini T, Badri H, Grosu D. Mechanisms for resource allocation and pricing in mobile edge computing systems[J]. IEEE Trans on Parallel and Distributed Systems, 2021, 33(3): 667−682.
[90]	He Ying, Fang Jingcheng, Yu F R, et al. Large language models (llms) inference offloading and resource allocation in cloud-edge computing: an active inference approach[J]. IEEE Trans on Mobile Computing, 2024, 23(12), 11253−11264.
[91]	Tan Haisheng, Wang Yi, Zhang Chi, et al. Asymptotically tight approximation for online file caching with delayed hits and bypassing[J]. IEEE Trans on Networking, 2025.