Processing math: 37%
  • 中国精品科技期刊
  • CCF推荐A类中文期刊
  • 计算领域高质量科技期刊T1类
高级检索

边缘智能计算系统中加速推荐模型训练的样本调度机制

李国鹏, 谈海生, 张弛, 倪宏秋, 王子龙, 章馨月, 徐洋, 田晗, 陈国良

李国鹏, 谈海生, 张弛, 倪宏秋, 王子龙, 章馨月, 徐洋, 田晗, 陈国良. 边缘智能计算系统中加速推荐模型训练的样本调度机制[J]. 计算机研究与发展. DOI: 10.7544/issn1000-1239.202550128
引用本文: 李国鹏, 谈海生, 张弛, 倪宏秋, 王子龙, 章馨月, 徐洋, 田晗, 陈国良. 边缘智能计算系统中加速推荐模型训练的样本调度机制[J]. 计算机研究与发展. DOI: 10.7544/issn1000-1239.202550128
Li Guopeng, Tan Haisheng, Zhang Chi, Ni Hongqiu, Wang Zilong, Zhang Xinyue, Xu Yang, Tian Han, Chen Guoliang. Samples Dispatching Mechanism for Accelerating Recommendation Model Training in Edge Intelligent Computing System[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202550128
Citation: Li Guopeng, Tan Haisheng, Zhang Chi, Ni Hongqiu, Wang Zilong, Zhang Xinyue, Xu Yang, Tian Han, Chen Guoliang. Samples Dispatching Mechanism for Accelerating Recommendation Model Training in Edge Intelligent Computing System[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202550128
李国鹏, 谈海生, 张弛, 倪宏秋, 王子龙, 章馨月, 徐洋, 田晗, 陈国良. 边缘智能计算系统中加速推荐模型训练的样本调度机制[J]. 计算机研究与发展. CSTR: 32373.14.issn1000-1239.202550128
引用本文: 李国鹏, 谈海生, 张弛, 倪宏秋, 王子龙, 章馨月, 徐洋, 田晗, 陈国良. 边缘智能计算系统中加速推荐模型训练的样本调度机制[J]. 计算机研究与发展. CSTR: 32373.14.issn1000-1239.202550128
Li Guopeng, Tan Haisheng, Zhang Chi, Ni Hongqiu, Wang Zilong, Zhang Xinyue, Xu Yang, Tian Han, Chen Guoliang. Samples Dispatching Mechanism for Accelerating Recommendation Model Training in Edge Intelligent Computing System[J]. Journal of Computer Research and Development. CSTR: 32373.14.issn1000-1239.202550128
Citation: Li Guopeng, Tan Haisheng, Zhang Chi, Ni Hongqiu, Wang Zilong, Zhang Xinyue, Xu Yang, Tian Han, Chen Guoliang. Samples Dispatching Mechanism for Accelerating Recommendation Model Training in Edge Intelligent Computing System[J]. Journal of Computer Research and Development. CSTR: 32373.14.issn1000-1239.202550128

边缘智能计算系统中加速推荐模型训练的样本调度机制

基金项目: 

国家自然科学基金重点项目(62132009)

undefined

详细信息
    作者简介:

    李国鹏: 1997年生. 博士研究生. 主要研究方向为边缘智能、基于大模型的智能体、机器学习系统

    谈海生: 1981年生. 博士,教授. CCF会员.主要研究方向为边缘智能、人工智能系统与网络

    张弛: 1995 年生. 博士,副教授. 主要研究方向为边缘计算、网络算法

    倪宏秋: 2000年生. 博士研究生. 主要研究方向为边缘计算、大语言模型推理、机器学习系统

    王子龙: 2000年生. 硕士研究生. 主要研究方向为边缘计算、调度机制、机器学习系统

    章馨月: 2000年生. 博士研究生. 主要研究方向为边缘计算、服务器无感知计算、机器学习系统

    徐洋: 2003年生. 硕士研究生,主要研究方向为边缘计算、机器学习系统、大语言模型

    田晗: 1989 年生. 博士,副研究员. 主要研究方向为机器学习及其在网络、系统、隐私计算中的应用

    陈国良: 1938年生. 教授. CCF会士. 主要研究方向为并行算法、计算机体系结构、计算智能

    通讯作者:

    谈海生(hstan@ustc.edu.cn

  • 中图分类号: TP303;TP393

Samples Dispatching Mechanism for Accelerating Recommendation Model Training in Edge Intelligent Computing System

Funds: 

undefined

This work is supported by the Key Program of the National Natural Science Foundation of China (62132009).

More Information
    Author Bio:

    Li Guopeng: born in 1997. PhD candidate. His main research interests include edge intelligence, large language model-based agent, and machine learning system

    Tan Haisheng: born in 1981. PhD, professor. Member of CCF. His main research interests include edge intelligence, and system and networking for AI

    Zhang Chi: born in 1995. PhD, associate professor. His main research interests include edge computing and network algorithms

    Ni Hongqiu: born in 2000. PhD candidate. Her main research interests include edge computing, large language model inference, and machine learning system

    Wang Zilong: born in 2000. Master candidate. His main research interests include edge computing, scheduling mechanism, and machine learning system

    Zhang Xinyue: born in 2000. PhD candidate. Her main research interests include edge computing, serverless computing, and machine learning system

    Xu Yang: born in 2003. Master candidate. His main research interests include edge computing, machine learning system, and large language model

    Tian Han: born in 1989. PhD, associate professor. His main research interest includes machine learning and its applications in networking, system and private computing

    Chen Guoliang: born in 1938. Professor. Fellow of CCF. His main research interests include parallel algorithms, computer architectures, and computational intelligence

  • 摘要:

    在边缘智能计算系统中使用边缘工作节点训练深度学习推荐模型(deep learning recommendation model,DLRM)具有诸多优势,尤其是在数据隐私保护、低延迟和个性化推荐等方面. 然而,由于嵌入表的规模庞大,在训练DLRMs时通常采用1个或多个参数服务器来维护全局嵌入表,同时利用多个边缘节点缓存嵌入表的一部分. 在此架构下,需要在边缘节点和参数服务器间传输嵌入以保证嵌入数据一致性,嵌入传输代价通常主导了训练周期. 目标旨在研究在边缘智能计算系统中,当面对异构网络和资源受限等挑战时,如何将嵌入样本调度到合适的边缘节点上进行训练,以最小化总嵌入传输代价. 为此,提出了一个基于预期嵌入传输代价的嵌入样本调度机制ESD. 在ESD中,设计了一个结合资源密集型最优解法和启发式解法的调度决策方法HybridDis,以实现决策质量和资源消耗之间的平衡. 使用C++和Python实现了ESD的原型系统,并在真实工作负载下将其与现有最先进的机制进行比较. 大量实验结果表明,ESD可将嵌入传输代价至多降低36.76%,并且在端到端DLRM训练速度上实现了最高1.74倍的加速.

    Abstract:

    Using edge workers to train deep learning recommendation model (DLRM) in edge intelligent computing system brings several benefits, particularly in terms of data privacy protection, low latency and personalization. However, due to the huge size of embedding tables, typical DLRM training frameworks adopt one or more parameter servers to maintain global embedding tables, while leveraging several edge workers to cache part of them. This incurs significant transmission cost for embedding transmissions between workers and parameter servers, which can dominate the training cycle. In this paper, we investigate how to dispatch input embedding samples to appropriate edge workers to minimize the total embedding transmission cost when facing edge-specific challenges such as heterogeneous networks and limited resources. We develop ESD, a novel mechanism that optimizes the dispatching of input embedding samples to edge workers based on expected embedding transmission cost. We propose HybridDis as the dispatch decision method within ESD, which combines a resource-intensive optimal algorithm and a heuristic algorithm to balance decision quality and resource consumption. We implement a prototype of ESD using C++ and Python and compare it with state-of-the-art mechanisms on real-world workloads. Extensive experimental results show that ESD reduces the embedding transmission cost by up to 36.76% and achieves up to 1.74 times speedup in end-to-end DLRM training.

  • 语音编解码是移动通信、互联网通信等众多领域的重要技术之一[1-5]. 语音信号在传输过程中经历了发送端的信号处理和特征提取,随后通过数据压缩进行传输至接收端,最终接收端通过解码器将恢复的特征解码重建成语音波形. 这构成了典型的语音编解码系统,包括编码器、量化器和解码器3个模块. 传统的语音编解码器大多基于数字信号处理方法,针对不同适用条件结合一些专家知识精心设计和选择构建,例如利用心理声学和语音合成等领域的知识来提高编码效率等[6-10]. 然而,这些方法不仅适用性受到限制,其生成的语音质量也有限[11]. 以神经网络为代表的机器学习方法最初仅被应用于语音降噪等编解码的后处理阶段[12-13]. 随着“数据驱动”模式下深度学习技术的进步,这些方法从辅助优化逐渐转变为编解码器本身的核心组件之一,不仅便于设计,而且展现了出色的性能,在不同网络带宽条件下均能解码得到较高的质量的重建语音[14-21].

    由于解码过程与语音合成领域的声码器同属于波形的生成过程,因此,早期的工作尝试直接用神经声码器模型实现语音解码[22-23]. 2017年,Kleijn 等人[17]使用 WaveNet[24]作为语音生成模型,利用其自回归生成能力,显著提高了解码语音的质量. 2018年,Srihari[25]实现了神经语音编解码系统的端到端优化,该模型无需手动特征工程,全面优化宽带语音编码管道中的各个步骤(包括压缩、量化和解压缩),大幅提升了系统的适应性. 2019年,Gârbacea等人[22]同时采用基于矢量量化变分自动编码 VQ-VAE[26]和 WaveNet 解码器的神经网络架构进行语音编解码. VQ-VAE通过将编码特征离散化完成数据压缩过程,提升了量化器的性能. 2021年,Kleijn 等人推出了 Lyra,对经过KLT变换(karhunen–loève transform,KLT)的梅尔谱进行矢量量化,并采用WaveGRU作为解码器,实现了高效的波形恢复. Zeghidour等人[11]开发的 SoundStream 利用生成对抗网络(generative adversarial network,GAN)的模式进行对抗性训练,并引入残差矢量量化的技巧,使单个模型能够处理不同的比特率,从而适用于各种网络带宽. 此外,这还是一种对波形采用全卷积编解码器的端到端编解码系统. 2022年,Alexandr等人[27]开发的Encodec在SoundStream基础上更进一步,引入了轻量级的语言模型、熵编码等技术进行优化,且可以通过分别处理左右声道来压缩立体声音频. 而Ratnarajah等人[28]则专为高效压缩多声道语音而提出M3-AUDIODEC,展示了神经编解码器在多声道音频编码上的显著提升. 最近,2023年,Yang等人[29]在SoundStream的基础上提出Hificodec,采用分组残差矢量量化方法,进一步提升了重建语音质量. 上述语音神经编解码方法普遍采用卷积编码器直接从波形中学习特征,获得了很好的重建语音质量,但卷积编码器通过调节卷积单元中下采样倍数完成时域帧的逐步压缩,在提取优秀潜在特征的同时,以一定的卷积计算量为代价.

    梅尔谱特征作为声学领域的经典手工特征,符合人耳听觉的感知特性[30-31]. 尽管过去的一些编解码器方法[23]曾利用梅尔谱图作为编码器输入,但其缺乏量化方法优化[32],相应的编解码设计使得相较SoundStream等波形方法重建音频质量有所不足.

    为了在保持重建语音质量的同时降低语音编码计算开销,本文提出了基于梅尔谱与压缩激励加权量化的语音神经编解码方法. 由于梅尔谱的提取过程便伴随着时域帧的逐步压缩. 因此,本文对梅尔谱运用卷积编码器进行特征提取,采用的卷积层数更少,降低计算规模,从而减少时延,以平衡更加多样的用户需求和更为稳定的用户体验. 对于量化器,由于卷积编码器各个输出通道信息量具有差异[33],该不均匀性将影响矢量量化过程中各个通道维度的重要程度. 因此,本文借鉴了压缩激励网络(squeeze-and-excitation networks,SENet)思想[26],提取编码器最后一层各维度的激励权重,使其作为量化器中计算码本距离时各维度的权重系数. 即通过编码器的自适应学习捕捉特征之间的相互依赖性,减少冗余信息[26],从而确定对量化器更重要的通道,提升量化性能.

    本文在LibriTTS[34]和VCTK[35]数据集上进行实验,结果表明使用梅尔谱图作为输入特征,结合低层卷积编码器,且采用压缩激励的方法利用梅尔谱图经过低层编码器输出特征信息量的不均匀性,不仅普遍可以降低时延,并且能在较低比特率环境中(3Kbps)提升感知质量. 以感知质量较好的基线Hificodec[29]为基准,比特率为1.5 Kbps时,该方法编码计算的实时率(real time rate,RTF)最多可提升至4.6倍. 此外,较低比特率的情况下,本方法的感知质量超越所有基准模型. 特别是在0.75 Kbps时,与最佳的Encodec[27]相比,短时客观可懂度(short-time objective intelligibility,STOI)[36]和虚拟语音质量客观评估(virtual speech quality objective listener,VISQOL)[37]的平均提升率为8.72%.

    本文探究了不同比特率下各压缩率与码本数目组合的消融实验. 发现比特率提高时,增加压缩率不利于模型的训练效果和性能,应当同时兼顾码本数目进行参数选择. 此外,通过对不同比特率和不同输入特征进行压缩激励权重的消融实验,本文探索了压缩激励权重方法的性质. 其优化效果与比特率呈反相关,且相较波形特征,该方法更适用于梅尔谱编解码器. 最后,本文还对神经解码器网络中的激活函数进行了消融实验,通过比较Relu激活函数和具有周期特性的Snake激活函数[38-39],研究结果表明:在保持语音质量相当的情况下,Relu激活函数能显著提高运行速度. 特别是在低比特率环境下(3Kbps),Relu函数在保持语音质量几乎不变的同时,实时率(RTF)均能提升2倍以上,因此更适合实际需求.

    本节从梅尔谱图、残差矢量量化、压缩激励网络3个方面介绍相关工作.

    梅尔谱图作为一种语音处理中常用的前端特征,其原理根植于人耳对频率感知的非线性特性,特别是对低频信号更为敏感. 为了模拟这种听觉特性,人们引入了梅尔标度,这种非线性对数变换针对频率标度进行转换,将语音语谱图的频率维度应用梅尔标度即为梅尔谱图[30,40]. 该转换过程涉及语谱图与多个梅尔滤波器相乘,而语音的语谱图则源自对语音序列进行短时傅里叶变换并提取幅度谱. 在短时傅里叶变换中,帧移操作对时域信号进行压缩,因此梅尔谱图的时域帧数远远少于原始波形数据点,为后续的编解码过程提供了便利.

    众所周知,卷积网络能够通过融合各层局部感受野中的空间和信道信息来构造信息特征. 压缩激励网络(SENet)[26]的本质是注意力机制在卷积网络领域中的应用,作为一个简化的结构, 它可以插入卷积网络中,并对通道关系之间的相互依赖性加以关注,此前在视觉等众多领域其适用性都得到了验证[41-42],这种子结构也适用于语音场景[43]. 具体而言,通过对卷积的每个输出通道,预测一个常数权重,并对该通道加权. 这种注意力机制以轻量级的计算代价让模型更偏向信号的最具信息性部分.

    在语音编解码器的量化环节中,编码器提取的特征需要被压缩,矢量量化是常用的方法之一[22]. 通过建立一张码本,将连续的特征空间转化为离散的token. 采用的码本数目越多,解码器恢复的语音质量就越高,但更多的码本也会消耗传输时的网络带宽. 因此,对于神经编解码器,不同的网络带宽环境需要专门训练各自数目的码本,这大大增加了使用和训练的成本. 对此,人们设计了残差矢量量化结构 (residual vector quantization,RVQ)[11],如图1所示. 该结构通过将上一量化层qi()离散化后与其输入标准值yi的残差作为下一量化层qi+1()的输入,最终累加各量化层离散化后的token作为输入连续特征x的重建值x,即

    图  1  残差矢量量化结构
    Figure  1.  The structure of residual vector quantization
    y1=x (1)
    yi+1=qi+1(yiqi(yi)) (2)
    x=Ni=1yi (3)

    其中量化层qi()是在码本空间中选择距离最近的条目,由于传输时仅记录条目的序号,从而实现信息压缩的效果. 上述过程,残差堆积层数N越多,离散化精度也越高. 此外,残差结构具有相当的灵活性. 可以一次训练较多层码本,而推理时根据带宽限制只选用前几层(nN)进行累积重建,大大扩宽了应用范围.

    为了加强码本空间的表示能力,可以采用对同一量化层码本拆分成多组的形式,即分组矢量残差方法,(group residual vector quantization,GRVQ)[29],进一步深化了量化能力.

    本文工作中将使用压缩激励机制对量化部分进行优化.

    本文采取系统整体框架如图2所示,生成部分分为编码器,量化器和解码器共3个组件. 此外,还包括2个鉴别器部分. 语音波形提取梅尔谱后,输入卷积编码器. 卷积编码器遵循与 Hificodec 相似的结构[29],但采用较少的卷积模块数目,即由首尾的1维卷积层和中间的卷积单元组成. 每个卷积单元由3个残差单元[44]和一个下采样层组成[11]. 解码器则由首尾的1维卷积层和中间4个卷积单元组成[11].

    图  2  基于梅尔谱与压缩激励加权量化的语音神经编解码方法结构图
    Figure  2.  Structure diagram of neural speech codec method based on Mel spectrogram and squeeze-excitation-weighted quantization

    量化器接受编码器的输出特征和压缩激励权重. 在编码器压缩率给定的情况下,灵活调节码本数目实现不同比特率下的调节. 在比特率相对较高(需要4个及以上的码本数目时),采取 Hificodec 的策略,即分组残差矢量量化(group residual vector quantization,GRVQ)的方法[29],为了训练的稳定,组数设置为2. 而码本数目低于4个时,则取消分组,退化为普通残差量化.

    对于鉴别器,本实验参考声码器 HifiGAN的策略,采用对语音波形直接检验的多尺度鉴别器(multi-scale discriminator,MSD)与多周期鉴别器(multi-period discriminator,MPD)[45]. 前者是通过核大小为4的步幅平均池化层(strided average pooling)对波形序列进行2和4倍的下采样之后的语音序列进行操作. 后者是将一维的原始语音序列T按照固定周期p抽样为长为p,宽为T/p的2维数据,然后对重塑后的数据应用2维卷积. 鉴别器最终输出判定情况与中间层检测时的特征图,以用于后续生成对抗损失和特征损失计算.

    编解码系统中压缩率是从波形采样点到进入量化器前时间帧的下采样倍数[27],由于本文采用梅尔谱特征,因此帧移(hop_size)本身具有压缩的作用. 编码器卷积单元中下采样层卷积核的采样步幅(stride)则承担了剩余的压缩效果. 解码器则所有的上采样倍数均来自卷积核的采样步幅. 因此需要满足:

    r=hop_size×iSenc,i=iSdec,i, (4)

    其中r是压缩率,Senc,iSdec,i分别指编码器和解码器第i个卷积单元中下采样层卷积核的采样步幅.

    式(4)的成立使得编码器端所需的卷积单元数量相较于直接从波形中进行下采样时减少. 尽管计算梅尔谱图的过程包含短时傅里叶以及梅尔滤波器相乘的运算,这也会带来一定的时间消耗,但其计算量低于具有相同压缩程度的卷积单元所进行的卷积运算,后续的实验证明了这一点.

    对于量化过程中的压缩激励权重. 本文希望可以通过动态地分配通道维度的权重来提高网络表达能力,具体而言,我们对卷积编码器的最后一层进行全局平均池化,全局空间信息被聚合到一个信道描述符中,即压缩过程. 继而通过先后2个全连接及中间的Relu层,接连Sigmoid函数,该步骤能学习被挤压通道之间的非线性交互,即激励过程[26]. 最终得到与编码器输出通道数 Cenc 相同的一维向量作为各通道评价分数[Score1,Score2,,ScoreCenc]. 如图2所示,作为通道注意力权重,该分数并非乘在编码器最后一层中,而是保留进入量化器模块. 尽管分组残差矢量量化(GRVQ)中分组的操作使得量化时实际的码本嵌入宽度与待查询的编码器输出通道数Cenc并不相等. 但是每轮残差后,各个分组码本会合并,合并后的总码本嵌入宽度 W 等于Cenc. 某帧 [x1,x2,,xCenc]查询码本时,需要计算该帧与码本中各条目[c1,c2,,cW]的距离并选取最近的条目,我们定义该距离为

    d=Wi=1Scorei×(xici)2. (5)

    本实验整体框架采用GAN结构,分为鉴别器损失和生成器损失2部分[46].

    对于包含多个子鉴别器的鉴别器组,K是鉴别器数目, Di代表多周期鉴别器MPD或多尺度鉴别器MSD的第i个子鉴别器,x 为原始波形样本,ˆx 即生成器输出的波形,定义鉴别器的对抗损失:

    LadvD=1KKi=1[(Di(x)1)2+D(ˆx)2]. (6)

    同时,生成器的对抗损失:

    LadvG=1KKi=1(Di(ˆx)1)2. (7)

    定义生成部分的重建波形的梅尔谱图ˆm与原始样本梅尔谱图m之间的L1距离为重建损失(reconstruction loss)Lrec.

    Lrec= (8)

    此外,对于 {D}_{i} 鉴别器的第 l 个中间层, w 为其输入的中间特征, \hat{w} 为该层输出,我们定义总体的E特征损失(feature loss) {L}_{\mathrm{f}\mathrm{e}\mathrm{a}\mathrm{t}} [27].

    {L}_{\mathrm{f}\mathrm{e}\mathrm{a}\mathrm{t}}=\frac{1}{KL}\sum _{i=1}^{K}\sum _{l=1}^{L}\frac{\|{D}_{i}^{l}\left(w\right)-{D}_{i}^{l}\left(\hat{w}\right)\|}{mean\left(\|{D}_{K}^{l}\left(w\right)\|\right)}. (9)

    量化过程中,第 n 组第 i 个量化器 {q}_{n,i} ,对于其待查帧 {z}_{n,i} ,定义承诺损失(commitment loss) {L}_{\mathrm{c}} [27].

    {L}_{c}=\sum _{n,i}{\|{z}_{n,i}-{q}_{n,i}({z}_{n,i})\|}_{2}^{2}. (10)

    综上所述,整体损失定义为

    {{L}_{\mathrm{t}\mathrm{o}\mathrm{t}\mathrm{a}\mathrm{l}}={\lambda }_{\mathrm{f}\mathrm{e}\mathrm{a}\mathrm{t}}L}_{\mathrm{f}\mathrm{e}\mathrm{a}\mathrm{t}}+{\lambda }_{\mathrm{c}}{L}_{\mathrm{c}}+{\lambda }_{\mathrm{a}\mathrm{d}{\mathrm{v}}_{\mathrm{G}}}{L}_{\mathrm{a}\mathrm{d}{\mathrm{v}}_{\mathrm{G}}}\\ +{\lambda }_{\mathrm{a}\mathrm{d}{\mathrm{v}}_{\mathrm{D}}}{L}_{\mathrm{a}\mathrm{d}{\mathrm{v}}_{\mathrm{D}}}+{\lambda }_{\mathrm{r}\mathrm{e}\mathrm{c}}{L}_{\mathrm{r}\mathrm{e}\mathrm{c}}\text{,} (11)

    其中 {\lambda }_{\mathrm{f}\mathrm{e}\mathrm{a}\mathrm{t}} {\lambda }_{\mathrm{c}} {\lambda }_{\mathrm{a}\mathrm{d}{\mathrm{v}}_{\mathrm{G}}} {\lambda }_{\mathrm{a}\mathrm{d}{\mathrm{v}}_{\mathrm{D}}} {\lambda }_{\mathrm{r}\mathrm{e}\mathrm{c}} 是超参数.

    本实验采用的训练数据集是LibriTTS英文多说话人数据集[34],该数据集由24 kHz的2 456位说话人组成,总计持续时长为585 h. 我们的评估数据集源自VCTK[22],该数据集与LibriTTS一样同属于英文多说话人数据集. 它包含110位英语多说话人录制的48 kHz语音. 我们从中随机抽取100条语音,并将其降采样为24 kHz.

    由于神经编解码器的压缩率、码本数量和其实现的比特率之间相互制约,确定其中2个参数后,剩余的参数也随之确定. 码本数量和压缩率均将影响模型的训练过称. 为了调节不同码本数目进行各比特率下的性能的比较,主实验中将压缩率 r 统一设置为320,梅尔谱图频域维度为80,hop_size设为160,对于压缩率的影响,后续亦将设计消融探究.

    此外,编码器卷积单元下采样倍数设置为[2]. 解码器卷积单元上采样倍数逐个设置为[8, 5, 4, 2]. 以编码器唯一的下采样卷积单元为例,其网络参数如表1所示:Conv 1进行下采样,此外,Conv 2~3,Conv 4~5,…,Conv 18~19等相邻2层卷积分别组成残差连接结构.

    表  1  卷积单元网络结构
    Table  1.  Network Architecture of Convolutional Unit
    卷积单元卷积核/步长/空洞数
    Conv 14/2/1
    Conv 211/1/1
    Conv 311/1/1
    Conv 411/1/3
    Conv 511/1/1
    Conv 611/1/5
    Conv 711/1/1
    Group Norm
    Conv 87/1/1
    Conv 97/1/1
    Conv 107/1/3
    Conv 117/1/1
    Conv 127/1/5
    Conv 137/1/1
    Group Norm
    Conv 143/1/1
    Conv 153/1/1
    Conv 163/1/3
    Conv 173/1/1
    Conv 183/1/5
    Conv 193/1/1
    Group Norm
    下载: 导出CSV 
    | 显示表格

    训练时,批大小取16 , 学习率为 0.000 2. 超参数 {\lambda }_{\mathrm{f}\mathrm{e}\mathrm{a}\mathrm{t}} {\lambda }_{\mathrm{c}} {\lambda }_{\mathrm{a}\mathrm{d}{\mathrm{v}}_{\mathrm{G}}} {\lambda }_{\mathrm{a}\mathrm{d}{\mathrm{v}}_{\mathrm{D}}} {\lambda }_{\mathrm{r}\mathrm{e}\mathrm{c}} 设置为:1,10,1,1,45. 编码器输出特征512维. 编码器输出特征512维. 单个码本的索引范围为[0~1 023]. 因此我们的模型和基线进行了比特率分别为0.75 Kbps,1.5 Kbps,3 Kbps,4.5 Kbps,6 Kbps的实验,分别对应1,2,4,6,8个数目的码本. 所有实验均在单卡2080Ti上训练了25个epoch,测试时统一采用CPU运行(Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40 GHz). 本实验采取Hificodec[20],Encodec[19],SoundStream[18]模型作为基线,其中Encodec和SoundStream 采用Yang等人[29]在Hificodec工作中重新实现的代码,见https://github.com/yangdongchao/AcademiCodec. 除了上述神经编解码器,本实验也与当前广泛应用的Opus进行了比较.

    本实验客观指标采用实时率(RTF)衡量时延,其定义为输入语音的时长与编码所需时间的比率[11]. 因此,该指标与编码器的运行速度正相关. 评测中,无论是本文方法还是基线,编码时间均包含了从波形输入到生成编码索引的整个过程,这也包括了波形到梅尔谱图的转换时间.

    本实验采用短时客观可懂度(STOI)[36],虚拟语音质量客观评估(VISQOL)[37],以及WARP-Q [47]评价语音的感知质量. 其中WARP-Q专门被提出应用于神经语音编解码器的评测指标,本文采用其原始评分值进行比较. 本实验主观指标采用MOS评分,该数据由5位志愿者对100条随机语音从1~5打分完成评测.

    本文的所有测试实验和消融实验都在VCTK上进行. 采用训练之外的同类别数据集进行测试的实验模式不仅符合实际的应用场景,也对模型的泛化能力要求更高. 最终,与神经编解码器结果如表2所示,可见本文方法在时间指标,即编码器端的实时率RTF相对所有同类神经编解码器基线在所测的任意比特率下都具有优势. 以基线里平均感知质量较好的Hificodec为比较基准,比特率1.5 Kbps时编码实时率提升幅度最大,最多可提升4.6倍. 而比特率3 Kbps时提升幅度较小,依达到了3.15倍. 对于语音感知质量指标STOI,WARP-Q,VISQOL,在比特率较低时(0.75 Kbps, 1.5 Kbps, 3.0 Kbps)表现良好,能超过所有基线模型,但比特率较高时感知质量却有所不足. 以极低比特率时为例,0.75 Kbps条件下,相较最好的Encodec客观感知质量平均提升8.72%. 但比特率为6 Kbps时,相较Hificodec,STOI,VISQOL指标已经低于基线,3种客观指标平均降低了0.46%. 主观MOS的测定结果也进一步支持了该结论. 从残差矢量量化的原理分析,所用的码本越多,残差层数就越深,但是除去第1层接收的是编码器输出特征以外,后续接收的是前层量化的残差. 因此,由编码器计算的压缩激励权重所反映通道之间的相关性和重要性对于后续残差层的效用有所削弱,甚至不再利于量化,后续的消融实验也进一步证实了这点,即对于重建语音感知质量,我们的方法更有利于较低比特率时,比特率较高时失效.

    表  2  各神经编解码器在VCTK数据集上的评价指标
    Table  2.  Metrics Evaluated on the VCTK Dataset for Each Neural Codec
    比特率/
    Kbps
    码本
    模型 {\mathrm{R}\mathrm{T}\mathrm{F}}_{\mathrm{e}\mathrm{n}\mathrm{c}} STOI ↑ WARP-Q ↓ VISQOL ↑ MOS ↑
    0.751SoundStream18.8170.7032.9882.1341.80
    Encodec19.1880.7362.9172.2262.28
    Hificodec5.5060.6343.0242.0281.88
    本文模型36.6560.7472.6042.5362.40
    1.52SoundStream15.4380.7362.8602.3152.52
    Encodec18.9230.7772.6942.4923.20
    Hificodec6.5070.7512.7092.5202.84
    本文模型36.4470.8012.3712.9583.62
    34SoundStream15.7600.7572.7982.4072.36
    Encodec18.5500.792.6252.5843.16
    Hificodec6.5490.8392.1253.3113.52
    本文模型35.6320.8461.9973.3343.88
    4.56SoundStream20.0360.7902.7112.6513.16
    Encodec22.1190.8092.5352.7583.40
    Hificodec6.5620.8621.9123.5623.62
    本文模型22.3380.8591.7983.4923.94
    68SoundStream20.1450.7992.5022.6943.50
    Encodec21.4230.8181.8522.7993.44
    Hificodec5.6730.8781.8523.6653.92
    本文模型23.4930.8651.7763.5523.98
    下载: 导出CSV 
    | 显示表格

    此外,表3展示了本文所提出的神经编解码器与目前实际场景中广泛使用的通用编解码方法Opus的对比结果. 由于Opus的最低支持带宽为6 Kbps,并且其帧大小为2.5 ms,5 ms,10 ms等固定值,因此该比较实验存在一定的局限. 所有编解码器均在6 Kbps的带宽条件下进行测试,神经编解码器采用8码本,压缩率取320倍. 结果显示,Opus在STOI指标上具有一些优势.

    表  3  本文方法与通用编解码器Opus的评测对比
    Table  3.  Evaluation Comparison of the Proposed Method and the General Codec Opus
    模型STOI ↑WARP-Q ↓VISQOL ↑MOS ↑
    Opus-10ms0.7032.6102.4763.32
    Opus-5.0ms0.8582.3833.0373.58
    Opus-2.5ms0.9561.9763.4753.86
    本文模型0.8651.7763.5523.98
    下载: 导出CSV 
    | 显示表格

    本实验中原始语音和重建语音的梅尔谱图对比如图3所示. 以6 Kbps和0.75 Kbps时为例,分别代表了高比特率与低比特率场景. 将图3(b)(c)、图3(d)(e)与图3(a)相比较,本文方法的重建语音梅尔谱图均与原始语音更为贴近,这意味着更相似的听感体验.

    图  3  梅尔谱图对比
    Figure  3.  Comparison of Mel spectrogram

    为了研究压缩率对所提出编解码器音质的影响,我们在比特率为3 Kbps和6 Kbps的条件下进行了消融实验. 实验中,较高的压缩率需要增加码本的数目. 具体的实验配置如表4所示,实验结果见表5. 在3 Kbps条件下,随着压缩率-码本数的增加,语音质量指标均有所提升. 然而,在6 Kbps条件下,尽管压缩率和码本数增加,音质和语音可懂度却出现下降. 这表明在更高的比特率中,扩大压缩率导致的码本数目增大并不总是对模型的训练和最终性能产生积极的影响. 这一结果强调了在不同比特率下,需要谨慎选择压缩率和码本数.

    表  4  不同压缩率的实验配置
    Table  4.  Experimental Configuration of Different Compression Rates
    压缩率 编码器下采样倍数 解码器上采样倍数 解码器各层卷积核
    180 [2] [5, 4, 4, 2] [10, 8, 8, 4]
    320 [2] [8, 5, 3, 2] [16, 11, 7, 4]
    640 [2] [10, 8, 4, 2] [20, 16, 8, 4]
    下载: 导出CSV 
    | 显示表格
    表  5  对于不同压缩率和码本数量的消融实验
    Table  5.  Ablation Experiments on Different Compression Ratios and Codebook Count
    比特率/Kbps压缩率码本数STOI ↑WARP-Q ↓VISQOL↑

    3
    18020.8432.0283.314
    32040.8461.9973.334
    64080.8541.8523.427

    6
    18040.8981.6083.786
    32080.8651.7763.552
    640160.8351.8483.387
    下载: 导出CSV 
    | 显示表格

    为了探究压缩激励方法的有效性和适用边界,我们将比特率为3个档次:1.5 Kbps,3 Kbps,6 Kbps进行了消融实验. 为了进一步对比,我们将输入特征分别设为梅尔谱图和波形均进行消融探究,其中波形实验仿照Hificodec模型并恢复了多层卷积编码器结构. 最终结果分别如表6表7所示,图4图5分别对表6表7中音频质量的各项指标进行了可视化.

    表  6  压缩激励加权机制的消融实验(输入特征为梅尔谱图)
    Table  6.  Ablation Experiments on Squeeze-Excitation-Weighted Mechanism (Input Characteristic is Mel Spectrogram)
    比特率/Kbps 码本数 指标 Mel-Input+SE-weight(本文) Mel-Input
    1.5 2 {\mathrm{R}\mathrm{T}\mathrm{F}}_{\mathrm{e}\mathrm{n}\mathrm{c}} 36.447 42.515
    STOI ↑ 0.801 0.795
    WARP-Q ↓ 2.371 2.318
    VISQOL ↑ 2.958 2.926
    3.0 4 {\mathrm{R}\mathrm{T}\mathrm{F}}_{\mathrm{e}\mathrm{n}\mathrm{c}} 35.632 44.498
    STOI ↑ 0.846 0.837
    WARP-Q ↓ 1.997 2.035
    VISQOL ↑ 3.334 3.310
    6.0 8 {\mathrm{R}\mathrm{T}\mathrm{F}}_{\mathrm{e}\mathrm{n}\mathrm{c}} 23.493 32.345
    STOI ↑ 0.865 0.866
    WARP-Q ↓ 1.776 1.733
    VISQOL ↑ 3.552 3.499
    下载: 导出CSV 
    | 显示表格
    表  7  关于压缩激励加权机制的消融实验(输入特征为波形)
    Table  7.  Ablation Experiments on Squeeze-Excitation-Weighted Mechanism (Input Characteristic is Waveform)
    比特率/Kbps 码本数 指标 Wave-Input+SE-weight Wave-Input
    1.5 2 {\mathrm{R}\mathrm{T}\mathrm{F}}_{\mathrm{e}\mathrm{n}\mathrm{c}} 5.527 6.507
    STOI ↑ 0.753 0.751
    WARP-Q ↓ 2.526 2.709
    VISQOL ↑ 2.630 2.520
    3.0 4 {\mathrm{R}\mathrm{T}\mathrm{F}}_{\mathrm{e}\mathrm{n}\mathrm{c}} 5.702 6.549
    STOI ↑ 0.837 0.839
    WARP-Q ↓ 2.147 2.125
    VISQOL ↑ 3.329 3.311
    6.0 8 {\mathrm{R}\mathrm{T}\mathrm{F}}_{\mathrm{e}\mathrm{n}\mathrm{c}} 5.369 5.673
    STOI ↑ 0.871 0.878
    WARP-Q ↓ 2.020 1.852
    VISQOL ↑ 3.477 3.665
    下载: 导出CSV 
    | 显示表格
    图  4  关于压缩激励加权机制的消融实验(输入特征为梅尔谱图)
    Figure  4.  Ablation experiments on squeeze-excitation-weighted mechanism (Input Characteristic is Mel spectrogram )
    图  5  关于压缩激励加权机制的消融实验(输入特征为波形)
    Figure  5.  Ablation experiments on squeeze-excitation-weighted mechanism (Input Characteristic is waveform)

    上述图表显示:梅尔谱特征下,压缩激励权重的方法在1.5 Kbps和3 Kbps的比特率下表现出一定的有效性,但在6 Kbps时效果开始减弱;而在波形特征下,该方法仅在1.5 Kbps的极低比特率下具有一定的效果,而在3 Kbps时其效果已显著降低. 同时,表4表5均显示:压缩激励方法会引发轻微的时延损耗,因此实际使用中应结合带宽环境进行取舍. 此外,我们可以认为,相较波形特征而言,梅尔谱特征作为输入与压缩激励权重方法更为适配,大多数情形下,这样组合在速度和质量上更具优势. 该现象也说明:梅尔卷积编码器各个输出通道信息量之间的差异性会大于波形卷积编码器. 这是符合认知的,因为梅尔谱频率维度之间本身就具有显著差异,人声语音的能量更多地集中在低频区域,且人耳对不同频带的感知不同,比如对低频感知更加明显,对高频信息感知的较为模糊,这样的差异性将更大地开发低比特率时压缩激励权重方法的潜能.

    本节旨在讨论激活函数对梅尔谱编解码器语音感知质量与时延的影响. 本节实验中,将在梅尔谱编解码器中进行Relu激活函数与Snake激活函数的对比. Snake函数作为一种具有周期性特征的激活函数,能够有效地适应语音波形高周期性的性质. 在声码器和波形编解码器任务中,Snake激活函数已被证实能显著提升语音质量[38-39]. 消融实验结果如表8所示,对于梅尔谱编解码器,Relu激活函数能在语音质量与Snake激活函数相当的情况下,运行速度均具有明显提升. 尤其是比特率较低的情况下,以0.75 Kbps和1.5 Kbps为例,Relu激活函数和Snake激活函数在语音质量上几乎没有差距,实时率(RTF)分别能提高2.16倍和2.09倍,因此选择Relu激活函数更适合应用场景实际需求.

    表  8  关于激活函数的消融实验
    Table  8.  Ablation Experiments on Activation Function
    比特率/Kbps 码本数 指标 Relu(本文) Snake
    0.751 {\mathrm{R}\mathrm{T}\mathrm{F}}_{\mathrm{e}\mathrm{n}\mathrm{c}} 36.65611.590
    STOI ↑0.7470.746
    WARP-Q ↓2.6042.568
    VISQOL ↑2.5362.538
    1.502 {\mathrm{R}\mathrm{T}\mathrm{F}}_{\mathrm{e}\mathrm{n}\mathrm{c}} 36.44711.784
    STOI ↑0.8010.810
    WARP-Q ↓2.3712.224
    VISQOL ↑2.9583.073
    3.004 {\mathrm{R}\mathrm{T}\mathrm{F}}_{\mathrm{e}\mathrm{n}\mathrm{c}} 35.63210.821
    STOI ↑0.8460.851
    WARP-Q ↓1.9971.989
    VISQOL ↑3.3343.234
    4.506 {\mathrm{R}\mathrm{T}\mathrm{F}}_{\mathrm{e}\mathrm{n}\mathrm{c}} 22.3385.915
    STOI ↑0.8590.892
    WARP-Q ↓1.7981.563
    VISQOL ↑3.4923.638
    6.008 {\mathrm{R}\mathrm{T}\mathrm{F}}_{\mathrm{e}\mathrm{n}\mathrm{c}} 23.4934.967
    STOI ↑0.8650.900
    WARP-Q ↓1.7761.448
    VISQOL ↑3.5523.658
    下载: 导出CSV 
    | 显示表格

    本文采用了使用梅尔谱做为编码器输入特征,并采用低层卷积编码器,采用压缩激励的方法利用了梅尔谱图经过低层编码器输出特征各通道信息量的不均匀性. 本文在LibriTTS 和VCTK数据集上进行实验,结果表明,该方法在提升编码器端运行速度上具有优势,减少了时延. 此外,在较低比特率的场景中,重建的语音相比波形编解码器基线具有更好的感知质量. 通过消融实验,本文探究了压缩激励权重方法在不同带宽和不同输入特征中的适用情况,进一步确定了压缩激励权重更适合低比特率条件的结论. 此外,本文还对编解码器的激活函数进行了消融探究,采用的Relu激活函数相比周期性Snake激活函数在运行速度上更具突出优势. 在未来,可以根据对梅尔谱图的先验知识,设计更高效的,结合其特点的编解码系统.

    作者贡献声明:周俊佐提出了算法思路、完成实验并撰写论文;易江燕提出指导意见并修改论文;陶建华提供了实验监督和项目管理;任勇实现了方法设计和实验设计;汪涛负责方法设计和实验设计,初稿修改.

  • 图  1   深度学习推荐模型系统架构

    Figure  1.   Architecture of deep learning recommendation model system

    图  2   深度学习推荐模型训练过程中的未命中拉取、更新推送和逐出推送操作

    Figure  2.   Miss Pull, Update Push, and Evict Push transmission operations in DLRM training

    图  3   ESD中的嵌入样本调度过程概览

    Figure  3.   Overview of embedding samples dispatching process in ESD

    图  4   匈牙利算法流程图

    Figure  4.   Flow chart of Hungarian algorithm

    图  5   总体性能

    Figure  5.   Overall performance

    图  6   命中率和传输操作组成

    Figure  6.   Hit ratio and ingredient of transmission operations

    图  7   代价降低和GPU资源消耗

    Figure  7.   Cost reduction and GPU resource consumption

    图  8   每个工作节点批量大小对性能的影响

    Figure  8.   Impact of batch size per worker on performance

    图  9   缓存比例对性能的影响

    Figure  9.   Impact of cache ratio on performance

    图  10   嵌入大小对性能的影响

    Figure  10.   Impact of embedding size on performance

    图  11   当使用4个工作节点时的实验结果

    Figure  11.   Experiment results when using four workers

    表  1   符号列表

    Table  1   List of Symbols

    符号 描述
    \mathcal{W} 边缘工作节点集合
    {\mathcal{E}}_{i} 迭代 {I}_{i} 的输入嵌入样本, {\mathcal{E}}_{i}=\{{E}_{1},{E}_{2},… ,{E}_{m\times n}\}
    {I}_{i} i 个训练迭代
    m 每个工作节点的批量大小
    {E}_{i} 一个嵌入样本, {E}_{i}=\{{x}_{1},{x}_{2},… \}
    {x}_{i} 一个嵌入样本的ID
    \boldsymbol{E}\boldsymbol{m}\boldsymbol{b}({{x}}_{{i}}) 嵌入样本ID为 {x}_{i} 对应的嵌入值(向量)
    {D}_{\rm tran} 一个嵌入的数据量
    {B}_{w}^{j} 工作节点 {w}_{j} 和参数服务器间的网络带宽
    {T}_{\rm tran}^{j} 工作节点 {w}_{j} 和参数服务器间传输一个嵌入的代价, {T}_{\rm tran}^{j}=\dfrac{{D}_{\rm tran}}{{B}_{w}^{j}}
    下载: 导出CSV

    表  2   在使用8个工作节点时,不同批量大小下串行与并行实现的匈牙利算法执行时间

    Table  2   Execution Time Between Serial and Parallel Implementations of Hungarian Algorithm for Different Batch Size when Using 8 Workers ms

    每节点
    批量大小
    32641282565121024
    CPU串行962528336050976134986
    GPU并行2128821868111385
    下载: 导出CSV

    表  3   实验所用负载

    Table  3   Workloads in Experiment

    负载序号 所用模型 数据集
    S1 WDL[26] Criteo Kaggle
    S2 DFM[27] Avazu
    S3 DCN[68] Criteo Sponsored Search
    下载: 导出CSV
  • [1]

    Gu Yulong, Bao Wentian, Ou Dan, et al. Self-supervised learning on users’ spontaneous behaviors for multi-scenario ranking in e-commerce[C]//Proc of the 30th ACM Int Conf on Information & Knowledge Management. New York: ACM, 2021: 3828–3837

    [2]

    Wang Jizhe, Huang Pipei, Zhao Huan, et al. Billion-scale commodity embedding for e-commerce recommendation in Alibaba[C]//Proc of the 24th ACM SIGKDD Int Conf on Knowledge Discovery & Data Mining. New York: ACM, 2018: 839–848

    [3]

    Smith B, Linden G. Two decades of recommender systems at amazon. Com[J]. IEEE Internet Computing, 2017, 21(3): 12−18 doi: 10.1109/MIC.2017.72

    [4]

    Gomez-Uribe C, Hunt N. The netflix recommender system: Algorithms, business value, and innovation[J]. ACM Transactions on Management Information System, 2015, 6(4): 1−19

    [5]

    Covington P, Adams J, Sargin E. Deep neural networks for Youtube recommendations[C]//Proc of the 10th ACM Conf on Recommender Systems. New York: ACM, 2016: 191−198

    [6]

    Schedl M, Knees P, Gouyon F. New paths in music recommender systems research[C]//Proc of the 11th ACM Conf on Recommender Systems. New York: ACM, 2017: 392−393

    [7]

    Sharma A, Jiang J, Bommannavar P, et al. Graphjet: Real-time content recommendations at Twitter[J]. Proceedings of the VLDB Endowment, 2016, 9(13): 1281−1292 doi: 10.14778/3007263.3007267

    [8]

    Boeker M, Urman A. An empirical investigation of personalization factors on Tiktok[C]//Proc of the ACM Web Conf 2022. New York: ACM, 2022: 2298−2309

    [9]

    Ying R, He Ruining, Chen Kaifeng, et al. Graph convolutional neural networks for web-scale recommender systems[C]//Proc of the 24th ACM SIGKDD Int Conf on Knowledge Discovery & Data Mining. New York: ACM, 2018: 974−983

    [10] 彭迎涛,孟小峰,杜治娟. 多样化推荐综述[J]. 计算机研究与发展,2025,62(2):285−313 doi: 10.7544/issn1000-1239.202330600

    Peng Yingtao, Meng Xiaofeng, Du Zhijuan. Survey on diversified recommendation[J]. Journal of Computer Research and Development, 2025, 62(2): 285−313 (in Chinese) doi: 10.7544/issn1000-1239.202330600

    [11]

    Wang Siqi, Feng Tianyu, Yang Hailong, et al. Atrec: Accelerating recommendation model training on CPUs[J]. IEEE Transactions on Parallel and Distributed Systems, 2024, 35(6): 905−918 doi: 10.1109/TPDS.2024.3381186

    [12]

    Sayed A, Himeur Y, Alsalemi A, et al. Intelligent edge-based recommender system for Internet of energy applications[J]. IEEE Systems Journal, 2021, 16(3): 5001−5010

    [13]

    Himeur Y, Alsalemi A, Al-Kababji A, et al. A survey of recommender systems for energy efficiency in buildings: Principles, challenges and prospects[J]. Information Fusion, 2021, 72: 1−21 doi: 10.1016/j.inffus.2021.02.002

    [14]

    Pourpanah F, Etemad A. Exploring the landscape of ubiquitous in-home health monitoring: A comprehensive survey[J]. ACM Transactions on Computing for Healthcare, 2024, 5(4): 1−43

    [15]

    Su Xin, Giancarlo S, Vincenzo M, et al. An edge intelligence empowered recommender system enabling cultural heritage applications[J]. IEEE Transactions on Industrial Informatics, 2019, 15(7): 4266−4275 doi: 10.1109/TII.2019.2908056

    [16]

    Yin Hongzhi, Chen Tong, Qu Liang, et al. On-device recommender systems: A tutorial on the new-generation recommendation paradigm[C]//Proc of the ACM Web Conf 2024. New York: ACM, 2024: 1280−1283

    [17]

    Cai Qiqi, Cao Jian, Xu Guandong, et al. Distributed recommendation systems: Survey and research directions[J]. ACM Transactions on Information Systems, 2024, 43(1): 1−38

    [18]

    Long Jing, Ye Guanhua, Chen Tong, et al. Diffusion-based cloud-edge-device collaborative learning for next POI recommendations[C]//Proc of the 30th ACM SIGKDD Int Conf on Knowledge Discovery & Data Mining. New York: ACM, 2024: 1324−1337

    [19]

    Yuan Wei, Qu Liang, Cui Lizhen, et al. Hetefedrec: Federated recommender systems with model heterogeneity[C]//Proc of the 40th Int Conf on Data Engineering. Piscataway, NJ: IEEE, 2024: 2976−2987

    [20]

    Yongbo Yu, Fuxun Yu, Xiang Sheng, et al. Eaglerec: Edge-scale recommendation system acceleration with inter-stage parallelism optimization on GPUs[C]//Proc of the 60th Design Automation Conf. Piscataway, NJ: IEEE, 2023: 1−6

    [21]

    Gong Yu, Jiang Ziwen, Feng Yufei, et al. EdgeRec: Recommender system on edge in mobile Taobao[C]//Proc of the 29th ACM Int Conf on Information & Knowledge Management. New York: ACM, 2020: 2477−2484

    [22]

    Himeur Y, Sohail S, Bensaali F, et al. Latest trends of security and privacy in recommender systems: A comprehensive review and future perspectives[J]. Computers & Security, 2022, 118: 102746

    [23]

    Guo Yeting, Liu Fang, Cai Zhiping, et al. PREFER: Point-of-interest recommendation with efficiency and privacy-preservation via federated edge learning[J]. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 2021, 5(1): 1−25

    [24]

    Li Youhuizi, Yu Haitao, Zeng Yan, et al. HFSA: A semi-asynchronous hierarchical federated recommendation system in smart city[J]. IEEE Internet of Things Journal, 2023, 10(21): 18808−18820 doi: 10.1109/JIOT.2023.3281909

    [25]

    Wu Jiang, Yang Yunchao, Hu Miao, et al. FCER: A federated cloud-edge recommendation framework with cluster-based edge selection[J]. IEEE Transactions on Mobile Computing, 2025, 24(3): 1731−1743 doi: 10.1109/TMC.2024.3484493

    [26]

    Cheng H T, Koc L, Harmsen J, et al. Wide & deep learning for recommender systems[C]//Proc of the 1st Workshop on Deep Learning for Recommender Systems. New York: ACM, 2016: 7−10

    [27]

    Guo Huifeng, Tang Ruiming, Ye Yunming, et al. DeepFM: A factorization-machine based neural network for CTR prediction[J]. arXiv preprint, arXiv: 1703.04247, 2017

    [28]

    Jiang Jiazhi, Tian Rui, Du Jiangsu, et al. Mixrec: Orchestrating concurrent recommendation model training on CPU-GPU platform[C]//Proc of the 41st Int Conf on Computer Design. Piscataway, NJ: IEEE, 2023: 366−374

    [29]

    Guo Huifeng, Guo Wei, Gao Yong, et al. ScaleFreeCTR: Mixcache-based distributed training system for CTR models with huge embedding table[C]//Proc of the 44th Int Conf on Research and Development in Information Retrieval. New York: ACM, 2021: 129−1278

    [30]

    Zhao Xiangyu, Wang Maolin, Zhao Xinjian, et al. Embedding in recommender systems: A survey[J]. arXiv preprint arXiv: 2310.18608, 2023.

    [31]

    Zhang Hailin, Liu Zirui, Chen Boxuan, et al. CAFE: Towards compact, adaptive, and fast embedding for large-scale recommendation models[J]. Proceedings of the ACM on Management of Data, 2024, 2(1): 1−28

    [32] 苗旭鹏,张敏旭,邵蓥侠,等. PS-Hybrid:面向大规模推荐模型训练的混合通信框架[J]. 清华大学学报(自然科学版),2022,62(9):1417−1425

    Miao Xupeng, Zhang Minxu, Shao Yingxia, et al. PS-Hybrid: Hybrid communication framework for large recommendation model training[J]. Journal of Tsinghua University (Science and Technology), 2022, 62(9): 1417−1425 (in Chinese)

    [33]

    Zhang Yuanxing, Chen Langshi, Yang Siran, et al. Picasso: Unleashing the potential of GPU-centric training for wide-and-deep recommender systems[C]//Proc of the 38th Int Conf on Data Engineering. Piscataway, NJ: IEEE, 2022: 3453−3466

    [34]

    Acun B, Murphy M, Wang Xiaodong, et al. Understanding training efficiency of deep learning recommendation models at scale[C]//Proc of the 27th IEEE Int Symp on High-Performance Computer Architecture. Piscataway, NJ: IEEE, 2021: 802−814

    [35]

    Song Xiaoniu, Zhang Yiwen, Chen Rong, et al. Ugache: A unified GPU cache for embedding-based deep learning[C]//Proc of the 29th Symp on Operating Systems Principles. New York: ACM, 2023: 627−641

    [36]

    Kaggle. Click-through rate prediction[EB/OL]. [2025-02-24]. https://www.kaggle.com/c/avazu-ctr-prediction.

    [37]

    Zeng Chaoliang, Liao Xudong, Cheng Xiaodian, et al. Accelerating neural recommendation training with embedding scheduling[C]//Proc of the 21st USENIX Symp on Networked Systems Design and Implementation. Berkeley, CA: Association, 2024: 1141−1156

    [38]

    Agarwal S, Yan Chengpo, Zhang Ziyi, et al. Bagpipe: Accelerating deep recommendation model training[C]//Proc of the 29th Symp on Operating Systems Principles. New York: ACM, 2023: 348−363

    [39]

    Youngeun K, Minsoo R. Training personalized recommendation systems from GPU scratch: Look forward not backwards[C]//Proc of the 49th Annual Int Symp on Computer Architecture. New York: ACM, 2022: 860−873

    [40]

    Adam P, Sam G, Francisco M, et al. Pytorch: An imperative style, high-performance deep learning library[C]//Proc of the 33rd Inter Conf on Neural Information Processing Systems. Red Hook: Curran Associates Inc, 2019: 8026−8037

    [41]

    Ma Kaihao, Yan Xiao, Cai Zhenkun, et al. Fec: Efficient deep recommendation model training with flexible embedding communication[J]. Proceedings of the ACM on Management of Data, 2023, 1(2): 1−21,

    [42]

    Saeed G, Lan Guanghui, Zhang Hongchao. Mini-batch stochastic approximation methods for nonconvex stochastic composite optimization[J]. Mathematical Programming, 2016, 155(1): 267−305

    [43]

    Chanwon P, Jemin L. Mobile edge computing-enabled heterogeneous networks[J]. IEEE Transactions on Wireless Communications, 2020, 20(2): 1038−1051

    [44]

    Li Yun, Ma Hui, Wang Lei, et al. Optimized content caching and user association for edge computing in densely deployed heterogeneous networks[J]. IEEE Transactions on Mobile Computing, 2020, 21(6): 2130−2142

    [45]

    Taegeon U, Byungsoo O, Minyoung K, et al. Metis: Fast automatic distributed training on heterogeneous GPU[C]//Proc of the 2024 USENIX Annual Technical Conf. Berkeley, CA: USENIX Association, 2024: 563−578

    [46]

    Ling Neiwen, Wang Kai, He Yuze, et al. Rt-mdl: Supporting real-time mixed deep learning tasks on edge platforms[C]//Proc of the 19th ACM Conf on Embedded Networked Sensor Systems. New York: ACM, 2021: 1−14

    [47]

    Z K, Xu Qiang, Meng Jiayi, et al. Accumo: Accuracy-centric multitask offloading in edge-assisted mobile augmented reality[C]//Proc of the 29th Annual Int Conf on Mobile Computing and Networking. New York: ACM, 2023: 1−16

    [48]

    Zhao M, Choudhary D, Tyagi D, et al. Recd: Deduplication for end-to-end deep learning recommendation model training infrastructure[J]. arXiv preprint, arXiv: 2211.05239, 2022

    [49]

    Miao Xupeng, Zhang Hailin, Shi Yining, et al. HET: scaling out huge embedding model training via cache-enabled distributed framework[J]. Proceedings of the VLDB Endowment, 2021, 15(2): 312−320 doi: 10.14778/3489496.3489511

    [50]

    Adnan M, Maboud Y, Mahajan D, et al. Accelerating recommendation system training by leveraging popular choices[J]. Proceedings of the VLDB Endowment, 2021, 15(1): 127−140 doi: 10.14778/3485450.3485462

    [51]

    Wang Chunnan, Wang Hongzhi, Wang Junzhe, et al. Autosr: Automatic sequential recommendation system design[J]. IEEE Transactions on Knowledge and Data Engineering, 2024, 36(11): 5647−5660 doi: 10.1109/TKDE.2024.3400031

    [52]

    Li Jiayu, He Zhiyu, Cui Yumeng, et al. Towards ubiquitous personalized music recommendation with smart bracelets[J]. Proc of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 2022, 6(3): 1−34

    [53]

    Wang Qinyong, Yin Hongzhi, Chen Tong, et al. Next point-of-interest recommendation on resource-constrained mobile devices[C]//Proc of the ACM Web Conf 2020. New York: ACM, 2020: 906−916

    [54]

    Long Jing, Chen Tong, Nguyen Q, et al. Decentralized collaborative learning framework for next POI recommendation[J]. ACM Transactions on Information Systems, 2023, 41(3): 1−25

    [55]

    Muhammad K, Wang Q, O'Reilly-Morgan D, et al. Fedfast: Going beyond average for faster training of federated recommender systems[C]//Proc of the 26th ACM SIGKDD Int Conf on Knowledge Discovery & Data Mining. New York: ACM, 2020: 1234−1242

    [56]

    Sun Zehua, Xu Yonghui, Liu Yong, et al. A survey on federated recommendation systems[J]. IEEE Transactions on Neural Networks and Learning Systems, 2024, 36(1): 6−20

    [57]

    Zhang Chunxu, Long Guodong, Zhou Tianyi, et al. Gpfedrec: Graph-guided personalization for federated recommendation[C]//Proc of the 30th ACM SIGKDD Int Conf on Knowledge Discovery & Data Mining. New York: ACM, 2024: 4134−4142

    [58]

    Ding Yuchen, Zhang Siqing, Fan Boyu, et al. Fedloca: Low-rank coordinated adaptation with knowledge decoupling for federated recommendations[C]//Proc of the 18th ACM Conf on Recommender Systems. New York: ACM, 2024: 690−700

    [59]

    Belal Y, Bellet A, Mokhtar S B, et al. PEPPER: Empowering user-centric recommender systems over gossip learning[J]. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, 2022, 6(3): 1−27

    [60]

    Xia S, Wei P, Liu Yanchen, et al. Reca: A multi-task deep reinforcement learning-based recommender system for co-optimizing energy, comfort and air quality in commercial building[C]//Proc of the 10th ACM Int Conf on Systems for Energy-Efficient Buildings, Cities, and Transportation. New York: ACM, 2023: 99−109

    [61]

    Gao Ye, Ma Meiyi, Gordon K, et al. A monitoring, modeling, and interactive recommendation system for in-home caregivers: Demo abstract[C]//Proc of the 18th ACM Conf on Embedded Networked Sensor Systems. New York: ACM, 2020: 587−588

    [62]

    Matam K, Ramezani H, Wang Fan, et al. Quickupdate: A real-time personalization system for large-scale recommendation model[C]//Proc of the 21st USENIX Symp on Networked Systems Design and Implementation. Berkeley, CA: USENIX Association, 2024: 731−744

    [63]

    Wang Zheng, Wang Yuke, Deng Jiaqi, et al. Rap: Resource-aware automated GPU sharing for multi-GPU recommendation model training and input preprocessing[C]//Proc of the 29th ACM Int Conf on Architectural Support for Programming Languages and Operating Systems. New York: ACM, 2024: 964−979

    [64]

    Yang Chen, Chen Jin, Yu Qian, et al. An incremental update framework for online recommenders with data-driven prior[C]//Proc of the 32th ACM Int Conf on Information & Knowledge Management. New York: ACM, 2023: 4894−4900

    [65]

    Sima C, Fu Y, Sit M K, et al. Ekko: A large-scale deep learning recommender system with low-latency model update[C]//Proc of the 16th USENIX Symp on Operating Systems Design and Implementation. Berkeley, CA: USENIX Association, 2022: 821−839

    [66]

    Yu Keping, Guo Zhiwei, Shen Yu, et al. Secure artificial intelligence of things for implicit group recommendations[J]. IEEE Internet of Things Journal, 2021, 9(4): 2698−2707

    [67]

    Deng Yongheng, Wang Guanbo, Yu Sheng, et al. Relayrec: Empowering privacy-preserving CTR prediction via cloud-device relay learning[C]//Proc of the 23rd ACM/IEEE Int Conf on Information Processing in Sensor Networks. Piscataway, NJ: IEEE, 2024: 188−199

    [68]

    Wang Ruoxi, Fu Bin, Fu Gang, et al. Deep & cross network for ad click predictions[C]//Proc of the 23rd ACM SIGKDD Int Conf on Knowledge Discovery and Data Mining. New York: ACM, 2017: 1−7

    [69]

    Chen Wenqiang, Zhan Lizhang, Ci Yuanlong, et al. FLEN: Leveraging field for scalable CTR prediction[J]. arXiv preprint, arXiv: 1911.04690, 2019

    [70]

    Adnan M, Maboud Y E, Mahajan D, et al. Heterogeneous acceleration pipeline for recommendation system training[C]//Proc of the 51st Annual Int Symp on Computer Architecture. Piscataway, NJ: IEEE, 2024: 1063−1079

    [71] 贺巩山,赵传磊,蒋金虎,等. 面向深度学习的数据存储技术综述[J/OL]. 计算机学报,2025. https://kns.cnki.net/kcms2/article/abstract?v=m17bUIR54SPhGmKv-wDT7IzxL9MQtlh87t6Zyxle_sH9wEjmdaSJSDnuYzFHGXtma3RHqTZAEE22Kg1stg272e4gDnqv4O166FZLuN3o2uhZtLJeU7sNmZoI8RlN4muKWBrqSj9uiSCwrA_RrJgX54JIKQ9-_AYicdqBtqjNZ8pz2iCCWvbC1Q==&uniplatform=NZKPT&language=CHS

    He Gongshan, Zhao Chuanlei, Jiang Jinhu, etal. A survey of data storage technologies for deep learning[J/OL]. Chinese Journal of Computers, 2025 (in Chinese) https://kns.cnki.net/kcms2/article/abstract?v=m17bUIR54SPhGmKv-wDT7IzxL9MQtlh87t6Zyxle_sH9wEjmdaSJSDnuYzFHGXtma3RHqTZAEE22Kg1stg272e4gDnqv4O166FZLuN3o2uhZtLJeU7sNmZoI8RlN4muKWBrqSj9uiSCwrA_RrJgX54JIKQ9-_AYicdqBtqjNZ8pz2iCCWvbC1Q==&uniplatform=NZKPT&language=CHS

    [72]

    Xie Minhui, Lu Youyou, Wang Qing, et al. Petps: Supporting huge embedding models with persistent memory[J]. Proceedings of the VLDB Endowment, 2023, 16(5): 1013−1022 doi: 10.14778/3579075.3579077

    [73]

    Wei Yingcan, Langer M, Yu Fan, et al. A GPU-specialized inference parameter server for large-scale deep recommendation models[C]//Proc of the 16th ACM Conf on Recommender Systems. New York: ACM, 2022

    [74]

    Goyal P, Dollár P, Girshick R, et al. Accurate, large minibatch SGD: Training ImageNet in one hour[J]. arXiv preprint, arXiv: 1706.02677, 2017

    [75]

    Kuhn H. The hungarian method for the assignment problem[J]. Naval Research Logistics Quarterly, 1955, 2(1/2): 83−97 doi: 10.1002/nav.3800020109

    [76]

    Lopes P, Yadav S, Ilic A, et al. Fast block distributed CUDA implementation of the hungarian algorithm[J]. Journal of Parallel and Distributed Computing, 2019, 130: 50−62 doi: 10.1016/j.jpdc.2019.03.014

    [77]

    Lawler E. Combinatorial Optimization: Networks and Matroids[M]. New York: Olt, Rinehart and Winston, 2001

    [78]

    Munkres J. Algorithms for the assignment and transportation problems[J]. Journal of the Society for Industrial and Applied Mathematics, 1957, 5(1): 32−38 doi: 10.1137/0105003

    [79]

    Kaggle. Display advertising challenge[EB/OL]. [2025-02-23]. https://www.kaggle.com/c/criteo-display-ad-challenge

    [80]

    Tallis M, Yadav P. Reacting to variations in product demand: An application for conversion rate (CR) prediction in sponsored search[J]. arXiv preprint, arXiv: 1896.08211, 2018

    [81]

    Delestrac P, Battacharjee D, Yang Simei, et al. Multi-level analysis of gpu utilization in ml training workloads[C]//Proc of 2024 Design, Automation & Test in Europe Conf & Exhibition. Piscataway, NJ: IEEE, 2024: 1−6

    [82]

    Shubha S, Shen Haiying, Iyer A. Usher: Holistic interference avoidance for resource optimized ML inference[C]//Proc of the 18th USENIX Symp on Operating Systems Design and Implementation. Berkeley, CA: USENIX Association, 2024: 947−964

    [83]

    Yuan Wei, Yang Chaoqun, Qu Liang, et al. Hide your model: A parameter transmission-free federated recommender system[C]//Proc of the 40th Int Conf on Data Engineering. Piscataway, NJ: IEEE, 2024: 611−624

    [84]

    Zhang Ye, Deng Yongheng, Yue Sheng, et al. DualRec: A collaborative training framework for device and cloud recommendation models[J]. IEEE Transactions on Mobile Computing, 2025. https://ieeexplore.ieee.org/abstract/document/10840283

    [85]

    Lian Xiangru, Yuan Binhang, Zhu Xuefeng, et al. Persia: An open, hybrid system scaling deep learning-based recommenders up to 100 trillion parameters[C]//Proc of the 28th ACM SIGKDD Int Conf on Knowledge Discovery & Data Mining. New York: ACM, 2022: 3288−3298

    [86]

    Lai Fan, Zhang Wei, Liu Rui, et al. AdaEmbed: Adaptive embedding for large-scale recommendation models[C]//Proc of the 17th USENIX Symp on Operating Systems Design and Implementation. Berkeley, CA: USENIX Association, 2023: 817−831

    [87]

    Zhao Xiangyu, Liu Haochen, Fan Wenqi, et al. Autoemb: Automated embedding dimensionality search in streaming recommendations[C]//Proc of the 21st Int Conf on Data Mining. Piscataway, NJ: IEEE, 2021: 896−905

    [88]

    Luo Qinyi, Wang Penghan, Zhang Wei, et al. Fine-grained embedding dimension optimization during training for recommender systems[J]. arXiv preprint, arXiv: 2401.04408, 2024

    [89]

    Bahreini T, Badri H, Grosu D. Mechanisms for resource allocation and pricing in mobile edge computing systems[J]. IEEE Transactions on Parallel and Distributed Systems, 2021, 33(3): 667−682

    [90]

    He Ying, Fang Jingcheng, Yu F R, et al. Large language models (LLMs) inference offloading and resource allocation in cloud-edge computing: An active inference approach[J]. IEEE Transactions on Mobile Computing, 2024, 23(12), 11253−11264

    [91]

    Tan Haisheng, Wang Yi, Zhang Chi, et al. Asymptotically tight approximation for online file caching with delayed hits and bypassing[J]. IEEE Transactions on Networking, 2025. https://ieeexplore.ieee.org/abstract/document/10936289

图(11)  /  表(3)
计量
  • 文章访问数:  118
  • HTML全文浏览量:  10
  • PDF下载量:  25
  • 被引次数: 0
出版历程
  • 收稿日期:  2025-02-28
  • 修回日期:  2025-04-06
  • 网络出版日期:  2025-04-14

目录

/

返回文章
返回