Survey on Key Technologies of Graph Processing Systems Based on Multi-core CPU and GPU Platforms
-
摘要:
图计算作为分析与挖掘关联关系的一种关键技术,已在智慧医疗、社交网络分析、金融反欺诈、地图道路规划、计算科学等领域广泛应用. 当前,通用CPU与GPU架构的并行结构、访存结构、互连结构及同步机制的不断发展,使得多核CPU与GPU成为图处理加速的常用平台. 但由于图处理具有处理数据规模大、数据依赖复杂、访存计算比高等特性,加之现实应用场景下的图数据分布不规则且图中的顶点与边呈现动态变化,给图处理的性能提升和高可扩展性带来严峻挑战. 为应对上述挑战,大量基于多核CPU与GPU平台的图处理系统被提出,并在该领域取得显著成果. 为了让读者了解多核CPU与GPU平台上图处理优化相关技术的演化,首先剖析了图数据、图算法、图应用特性,并阐明图处理所面临的挑战. 然后分类梳理了当前已有的基于多核CPU与GPU平台的图处理系统,并从加速图处理设计的角度,详细、系统地总结了关键优化技术,包括图数据预处理、访存优化、计算加速和数据通信优化等. 最后对已有先进图处理系统的性能、可扩展性等进行分析,并从不同角度对图处理未来发展趋势进行展望,希望对从事图处理系统研究的学者有一定的启发.
Abstract:As a key technique for analyzing and mining relationships, graph computing has been widely used in smart healthcare, social network analysis, financial anti-fraud, road navigation, computational sciences, and others. In recent years, with the development of parallel structures, memory access methods, interconnection structures, and synchronization mechanisms of general-purpose CPU and GPU architecture, multi-core CPU and GPU have become common platforms for accelerating graph processing. However, there are significant challenges to graph processing in terms of improving performance and achieving high scalability, such as the large scale and irregular distribution of data, complex data dependence, high communication-to-computation ratio, and dynamic changes of vertices and edges. To address the above challenges, a large number of graph processing systems based on multi-core CPU and GPU platforms are proposed and achieve good results. To provide readers with a comprehensive understanding of graph processing optimizations on multi-core CPU and GPU platforms, we first present the basic concepts, including the graph data structures, the typical graph algorithms, and the characteristics of graph application. Then the challenges of graph processing are clarified. After that, we present the existing graph processing systems based on multi-core CPU and GPU platforms. From the perspective of accelerated graph processing design, we summarize the key technology optimizations in detail and systematically, including pre-processing, memory access optimization, computing acceleration, and overhead reduction of data communication. Finally, we analyze the performance and scalability of state-of-art graph processing and conclude the future development trend of graph processing based on different perspectives, which expects to bring certain inspiration to relevant researchers to explore high-performance graph processing system.
-
自Vaswani等人[1]提出Transformer架构以来,其核心组件——自注意力(self-attention)机制通过动态长程依赖建模能力,成功突破了传统卷积神经网络(CNN)在感受野限制上的桎梏. 这一突破催生了视觉Transformer(vision Transformer,ViT)模型[2]及其变体,并在计算机视觉领域引发范式变革. DeiT[3],Swin Transformer[4]等模型在图像分类领域超越ResNet系列[5],DETR[6]重构目标检测范式,MaskFormer[7]推动图像语义分割问题上的精度突破,ViT-GAN[8]在图像合成领任务中展现细粒度控制能力,DiT[9]增强了图像生成任务中的广度. 然而,其出色的性能背后是高昂的计算开销和内存占用. ViT-L模型参数量达307×106,单张224×224图像推理需190.7 GFLOPs[10]. 这与其在资源受限的边缘设备[11]上的部署需求形成显著矛盾.
为缓解ViT的硬件部署压力,模型量化(model quantization)是模型压缩的核心技术之一. 模型量化将32位浮点精度(FP32)权重或激活值映射至低位宽(如8/4 bit)整数空间表示,有效降低模型大小. 同时利用低位宽定点运算代替高精度浮点运算以加速模型推理性能. 训练后量化(post-training quantization,PTQ)[10-15]是一类高效实用的模型量化方法. PTQ方法基于统计先验的轻量化校准,仅需少量无标签校准数据甚至不需要校准数据即可完成模型量化,其免训练特性能够适配边缘设备实时部署需求.
多种 PTQ 方法被用于 ViT 量化[10-15]. 针对ViT中的特定层的输出激活,采用复杂高阶函数 [13,16]、Hessian矩阵指导的度量方法[12]等进行量化指导和补偿,在 8/6比特量化中取得了成效. 然而,将现有PTQ方法应用于低比特(如 4 bit)量化中时仍然面临不可忽视的挑战:
1)低位宽表示引起模型性能衰减. ViT中激活值的高维特征空间存在显著长尾分布现象,同时将大量数据聚合到少量点位上的不恰当拟合导致的量化损失也无法通过反量化得到补偿.
2)量化策略与计算特性的错位匹配. ViT架构,如ViT-Base,线性层FFN,QKV Gen,Proj等占据了80%以上的计算开销,其计算密度是LayerNorm的98倍、Softmax的294倍,GELU的45倍[18]. 然而,现有研究更多地强调非线性层的特殊分布和激活量化,忽视了ViT自身的计算特性和资源利用特性,缺乏量化算法与计算加速相结合的全局性视角.
3)硬件适配性缺失. 现有PTQ量化算法与底层硬件架构的协同设计存在显著差距. 尽管多数方法宣称硬件友好性,但出于模型精度考虑,量化数据需要在计算前反量化到原始精度后再进行计算,并未充分利用硬件支持的低位宽计算以提升计算性能.
针对上述3方面的挑战,本文从“量化-计算联合优化”出发,针对ViT低位宽量化场景展开系统性研究. 通过分析量化敏感度并借助z-score方法,揭示了低比特量化误差与计算效率间的耦合作用机制,并发现其与异常激活特征及概率分布特性的内在关联规律:
1)量化-计算的非一致性. 量化导致性能衰减强的部分并不意味着其也是高计算/存储开销部分. 这表明即使对该部分设计了精细的量化算法,量化带来的计算收益和存储收益都可能相当有限.
2)高量化敏感度均表现出显著离群值(outlier)现象,同时,实验中还观察到离群值的普遍存在. 这表明低位宽量化的关键在于高动态范围的离群值表征与低位宽约束间的本质冲突,而非单纯源于数据的长尾分布或高斯分布本身.
3)数值分布中存在的普适约束性. 尽管ViT激活分布中呈现明显的非高斯特性(长尾分布),其数值范围仍然保持了高度集中性,即至少97%的数据呈现出聚集现象.
基于上述发现,本文提出面向ViT的PTQ自适应分治量化(divide-and-conquer and adaptive quantization,DAQ)方法. DAQ在保持硬件兼容性的前提下实现4-bit量化的性能突破,主要贡献有3点.
1)针对量化-计算/存储开销联合分析,揭示ViT系列模型中的2个关键规律:一是计算密集型层与量化敏感层存在显著空间错位(如Softmax层量化可带来80%的性能下降却仅占8%的计算负载);二是借助z-score方法发现非高斯分布中隐藏数值集中特性(97%激活值分布仍呈类高斯分布的聚集性),突破传统高斯假设局限.
2) 基于贡献1)提出了针对ViT的4-bit PTQ量化方法DAQ. DAQ支持融合分治策略与自适应参数调节:分治策略基于数据聚集性,利用动态感知的z-score方法将数据划分为正常值与离群值,自适应参数动态调节划分范围以获得更高量化性能. 基于量化-计算瓶颈分析的结论,DAQ以层归一化(layers normalization,LN)和GELU输出激活量化出发,形成了普适性的量化方法. DAQ在不同模型、不同量化对象的实验中都表现出SOTA性能,获得了最大达4.37%的精度提升,目标检测任务上4-bit量化模型的平均精度损失低至0.4%. 在DeiT-T和Swin-S上达成的量化模型超越全精度模型.
3)DAQ充分考虑量化-计算协同设计,提出计算友好的均匀关联量化方法,同时依靠加速器硬件架构如GPU Tensor Core等加速内核,实现低位宽定点计算代替高位宽浮点运算,实验显示DAQ未引起明显性能降低,同时对线性计算有2%的计算加速.
1. 相关工作
PTQ作为深度学习模型压缩领域的主流范式,因其无需微调的特性而具备显著的操作便捷性. 缺乏训练阶段的误差补偿机制也导致PTQ方法在低位宽(如4 bit)动态激活值量化场景中面临严峻的精度保持挑战. 针对ViT架构的激活量化研究,现有方法可根据量化对象划分为以下3类: Softmax输出激活(post-softmax activation)量化、GELU输出激活(post-GELU activation)量化和LayerNorm输出激活(post-LayerNorm activation)量化.
针对Softmax输出激活,FQ-ViT[10]采用非均匀的log2量化方法,为其中的高频小值分配更多更精细的量化点以获得更小的量化误差. PTQ4ViT[11]引入的孪生均匀量化(twin uniform quantization)方法将Softmax输出激活分成2个独立的量化区间以实现更为细致的量化过程,其由2个唯一但不独立的缩放因子控制,2个缩放因子之间具有2k倍的关联性;APQ-ViT[13]利用Softmax函数的马修效应(Matthew effect),实现了一种非对称线性量化方法以改善量化误差的分布. RepQ-ViT[14],RepQuant[15]先采用log√2量化以更好地拟合具有幂律分布的注意力分数,随后对缩放因子重参数化以使其基数为2,从而在推理中实现以快捷移位计算代替反量化. TSPTQ-ViT[19]利用非正态分布值中的位稀疏性为Softmax输出激活的不同区域分配不同的比例因子以获得更小量化误差.
对于GELU输出激活,PTQ4ViT[11]仍然采用了与Softmax输出激活相一致但将区间划分改为由符号决定的孪生均匀量化方法. TSPTQ-ViT[17]也采用了类似的双区域策略,并针对GELU输出激活值中的符号位适应性处理.
对LayerNorm输出激活,FQ-ViT[10]提出PTF(power-of-two factor)量化方法. 作为一种逐通道量化方法的变形,PTF按通道尺度分配量化调节因子α,缩放因子取2αs而非s. RepQ-ViT[14]针对LayerNorm输出采用逐通道量化后再调整量化参数缩放因子s和零点偏移z,同时修改了LayerNorm对应的权重参数. RepQuant[15]扩展了逐通道双边阶段策略以实现更为准确的量化,最大限度减少量化空间内的偏差. TSPTQ-ViT[17]则利用K-means算法筛选离群值后,选择一组线性正常值基于Hessian-Guide方法求解缩放因子,该过程可与权重的缩放因子联合优化以获得更佳性能.
当前基于PTQ的ViT量化研究存在3个方面的局限:1)低位宽(4 bit)量化性能下降严重;2)量化算法泛化性不足,现有方法普遍针对特定分布和层量化,过分精细地针对性量化设计而导致方法扩展性和适应性不佳;3)硬件部署低效. 尽管多数方法宣称硬件友好性,但量化/反量化开销实际上降低了计算效率,同时量化方法也未充分利用新一代AI加速器特性等,缺乏直接利用量化计算加速的相关实践.
2. 预备知识
2.1 ViT
ViT[2]是基于Transformer编码器结构的图像分类模型. ViT将原本CNN中的卷积结构用Transformer替换,实现了Transformer架构从自然语言处理(natural language processing,NLP)领域到计算机视觉(computer vision,CV)领域的迁移. 为了进一步提高性能,ViT 的变体如 DeiT[3] 和 Swin [4]相继被提出,并在图像分类[1,18-19]、目标检测[6,20-21]、语义分割[22-23]、图文生成[24]等计算机视觉任务中取得了巨大的成功.
图1展示了ViT模型的组成结构. ViT由图像块嵌入(patch embedding)和L个Transformer 编码器模块(block)构成. 具体而言,ViT首先将输入大小为H(高度)×W(宽度)×C(通道数)的图像分割为 N 个大小为P×P的非重叠图像块(patches),经过线性嵌入将每个图像块展平为一维向量后线性投影到 d 维嵌入空间中,得到图像输入张量 \boldsymbol{X}\in {\mathbb{R}}^{N\times d} 并参与到一系列Transformer编码器模块的计算中. 每个编码器模块由一个多头自注意(multi self-attention,MSA)子模块和一个前馈神经网络(feed-forward neural network,FFN)子模块组成,每个子模块后添加残差连接和LN.
编码器模块的计算过程可以用公式统一描述为:
{\boldsymbol Y}_{l}=MS A\left(LN\left({\boldsymbol X}_{l-1}\right)\right)+{\boldsymbol X}_{l-1}\text{,} (1) {\boldsymbol X}_{l}=FFN\left(LN\left({\boldsymbol Y}_{l}\right)\right)+{\boldsymbol Y}_{l}\text{,} (2) 其中 l=\mathrm{1,2},… ,L 表示编码器模块编号索引. 如图1所示,MSA子模块通过多头自注意力机制学习不同图像块间的关联性:
\left[{\boldsymbol{Q}}_{i},{\boldsymbol{K}}_{i},{\boldsymbol{V}}_{i}\right]=QKVGen\left({\boldsymbol{X}'}\right)={\boldsymbol{X}'}{\boldsymbol{W}}^{QKV}+{\boldsymbol{b}}^{QKV}x \text{,} (3) Att{n}_{i}=S oftmax\left({\boldsymbol{Q}}_{i}{\boldsymbol{K}}_{i}^{\rm T}/\sqrt{{d}_{h}}\right){\boldsymbol{V}}_{i}\text{,} (4) MS A\left({\boldsymbol{X}'}\right)=Proj\left(Att{n}_{1},… ,Att{n}_{h}\right)=Attn{\boldsymbol{W}}^{p}+{\boldsymbol{b}}^{p}\text{,} (5) 其中 i=1,… ,h ,h表示自注意力机制中的多头数, {\boldsymbol{W}}^{QKV}\in {\mathbb{R}}^{d\times 3{d}_{h}},{\boldsymbol{b}}^{QKV}\in {\mathbb{R}}^{3{d}_{h}},{\boldsymbol{W}}^{p}\in {\mathbb{R}}^{h{d}_{h}\times d},{\boldsymbol{b}}^{p}\in {\mathbb{R}}^{d} . QKV Gen基于输入激活 \boldsymbol{X}' 计算Q、K、V矩阵,Proj投影层通过线性计算得到MSA的输出结果. Transformer结构的二次复杂性来源于自注意力机制中的Softmax过程. \boldsymbol{Q}{\boldsymbol{K}}^{\rm{T}} 的计算复杂度为 O({N}^{2}d) , \boldsymbol{Q},\boldsymbol{K}\in {\mathbb{R}}^{N\times d} ,因而自注意力机制的计算复杂度随着输入维度N的增长而二次增加. 在大语言模型(large language model,LLM)中,输入长度和上下文长度一般远超过隐藏维度d,大量计算开销会集中于自注意力机制. 但ViT与LLM有一个显著区别:当输入图像分辨率不变时,ViT划分输入图像为子图时采用的划分数量N一般也固定不变. 因此,ViT模型的计算瓶颈与LLM模型的计算瓶颈并不完全一致.
FFN模块将特征投影到更高维度来学习特征的不同表达. 设FFN的输入为 \boldsymbol{Y}' ,FFN的计算过程为:
FFN\left({\boldsymbol Y'}\right)=GELU\left({\boldsymbol Y'}{\boldsymbol W}^{\rm FC1}+{\boldsymbol b}^{\rm FC1}\right){\boldsymbol W}^{\rm FC2}+{\boldsymbol b}^{\rm FC2}. (6) 2.2 量 化
模型量化(model quantization)是模型压缩领域的关键技术之一. 其核心目标是将高位宽浮点数表示的权重(weights)和/或激活值(activations)用较低位宽定点数表示,从而显著减少内存占用和带宽需求. 同时,通过用低位定点运算代替高位浮点运算,有望提升计算吞吐率并降低能耗,加速模型推理过程.
2.2.1 量化对象
需要量化的目标张量可以分为权重量化和权重-激活量化.
权重量化仅量化模型静态权重,减少模型参数,量化过程通常不需要或者只需要小部分校准数据集参与. 但仅量化权重通常无法获得计算性能的提升,反而由于需要在推理过程中保持计算精度而增加反量化开销.
权重-激活量化则是同时量化静态权重和动态激活. 激活值由模型推理过程动态产生,因而需要校正数据集来辅助预测激活值的范围并设计相应的量化方法. 权重-激活量化能带来存储和计算2个维度上的优势.
DAQ主要针对激活量化,但可以和其他权重量化方法结合,同时对权值-激活量化.
2.2.2 量化粒度
根据共享量化参数的范围,量化粒度由粗到细可以分为逐层量化、逐组量化和逐通道(channel)/Token量化3类.
逐层量化以一个层的输出激活值或整层的权重为单位,特征张量共用一组缩放因子(scale factor)s和零点(zero point)偏移z .
逐组量化是对目标张量进行分组,以组(group)为单位,组内共享缩放因子s和零点偏移z ,组间的量化参数可以不同. 当组等于1时,逐组量化与逐层量化等价;当以通道或者Token为分组单位时,逐组量化就是逐通道/Token量化.
当量化粒度越细,量化参数包含的补偿信息越多,量化误差越小,但量化过程越复杂,量化和量化后相关的计算效率越低. 因此逐层量化因参数简单而硬件友好度最高,本文中DAQ采用粗粒度逐层量化.
2.2.3 均匀量化和非均匀量化
根据量化方法中的取值域是否等间隔,量化可以分为均匀量化(uniform quantization)和非均匀量化(non-uniform quantization).
均匀量化因其数学形式简洁且硬件友好而成为工业界应用最广泛的量化方案. 其核心思想是将实数域映射到等间隔离散量化电平上,数学形式定义为:
Q\left(x;b,s,z\right)=clip\left(\left\lceil \frac{x-z}{s}\right\rfloor ,0,{2}^{b}-1\right)\text{,} (7) 其中 x\in \mathbb{R} 为待量化数据, b\in {\mathbb{Z}}^{+} 表示量化位宽, z\in \mathbb{R} 为零点偏移, s\in {\mathbb{R}}^{+} 为缩放因子. \lceil \cdot \rfloor 表示近似取整的数学函数,可以是四舍五入、向上/向下取整等. clip函数为截断函数.
clip\left(x;a,b\right)=\left\{\begin{aligned} & a,\;\;{\rm{if}}\;x < a\text{,}\\ & x,\;\;{\rm{if}}\;a\le x\le b\text{,}\\ & b,\;\;{\rm{if}}\;x > b.\end{aligned}\right. (8) 均匀量化的反量化过程为:
DeQ\left(\hat{x};s,z\right)=\left(\hat{x}+z\right)s . 均匀量化参数计算高效,仅需确定z 与s即可完成全域映射,时间复杂度为O(1). 同时等间隔特性也使利用量化数据进行计算更加简便. 但另一方面,单一均匀量化对长尾分布数据的适应性较差,易导致尾部区域的量化误差累积.
非均匀量化作为处理非对称数据分布的关键技术之一,对具有显著长尾特征的数据建模更为精确. 研究者提出了多种非线性量化函数如指数量化[10]、对数量化[16]等. 非均匀量化的难点在于计算复杂度高以及难以找到合适的非均匀量化映射函数.
2.2.4 量化与模型训练
基于量化过程与模型训练的耦合程度,现有方法可分为量化感知训练 (quantization-aware training,QAT)和训练后量化 (post-training quantization,PTQ).
QAT通过将量化噪声注入前向传播过程,联合优化模型参数与量化器超参数(如缩放因子s、零点偏移z ). 尽管QAT能通过端到端微调获得更高的量化模型性能,但其重训练过程计算开销高昂. 另一方面,QAT依赖大规模数据进行微调,微调过程本身也可以导致模型无法收敛,也不适用于小样本数据集或样本获取困难的场景.
PTQ则提供了一种无需训练的方法(或用于校准目的的最小训练成本),以实现快速有效的量化. 同时,PTQ可直接生成符合INT8/FP16等工业标准的量化模型,硬件兼容性更高. 低计算成本、易部署的特性使得PTQ方法在资源受限的边缘计算场景中展现出更优的工程可行性.
本文研究专注于PTQ低位宽(4 bit)的量化实现.
3. 模型分析
ViT低位宽量化的核心挑战在于动态激活值的高效量化. 为了充分利用ViTs中的数据特性,本节将分析ViT激活量化瓶颈及量化计算效率,挖掘低位宽量化引起性能衰减的原因,探索量化有效目标,指导低位宽量化方法的设计.
3.1 低位宽量化瓶颈分析
为了定位ViT模型低位宽量化的瓶颈所在,本节将基于ViT-S架构开展分层量化敏感性分析,采用Min-Max均匀量化算法[16],针对不同功能层LayerNorm,Softmax,FFN,GELU等的输出激活值实施6/5/4比特低位宽量化,评估单一模块量化对模型性能的影响(ImageNet数据集上 Top-1模型的精度变化),量化当前模块的输出等价于量化下一模块的输入. 测试结果如图2所示. 为便于分析,图2中的模块量化精度顺序由高到低排序,而非按照模块中计算顺序LN、QKV Gen、Softmax、Proj、FC1、GELU、FC2进行排序.
图2表明,不同层对低位宽量化误差敏感度不同,从而导致模型实际性能下降有所区别. MSA模块中的QKV Gen和Proj线性层的输出激活,随着量化位宽精度的下降而引起的模型精度衰减并不明显;但当量化位宽低于某个阈值,比如5 bit时,LN和FC1的量化误差传递给下一层后都出现了较为明显的精度下降,下降为25%~35%;而GELU,Softmax和FC2层输出激活量化与模型的性能损失呈强正相关性;即随着量化位宽逐渐减小,模型精度迅速衰减. 当采用4 bit量化FC2输出时,量化误差造成模型性能崩溃.
为了进一步研究部分层输出激活明显受量化位宽影响的原因,图3分析了其输出激活的分布状态. 其中,LN2(第2个Layer Norm的输出),FC1和FC2的输出激活都服从或近似服从正态分布,但明显存在高动态范围的离群值;非线性层GELU和 Softmax的输出呈现显著的长尾特性,其离群值亦导致特征张量的动态范围呈指数级扩展.
因此,实验表明量化性能下降的核心矛盾在于异常值表征与低位宽约束间的本质冲突,而非单纯源于数据的长尾分布或高斯分布本身.
图4为将离群值研究扩展到其他ViT系列模型中的表现,统计了ViT及其变体模型推理过程中所有激活张量的极大值与极小值. 图4显示离群值现象在ViT模型中保持跨模型的一致性和显著性. 这与LLM中离群值仅在模型规模较大(≥7 B参数)时才观察到的现象[25]不同,ViT模型即使在微型架构(如ViT-Tiny,5×106参数)也表现出显著的激活离群值现象. 而且,Llama 7B[26]中的离群值变化幅度(9.7)[15]远低于ViT( {10}^{2}\sim {10}^{3} ).
跨模型尺度的离群值泛在现象进一步表明,传统均匀量化方法下,低位宽(4 bit及以下)的有限表征能力将导致显著的离群值信息损失,进而损害模型性能.
为量化分析ViT中的离群值,本文利用统计学方法中的标准分数z-score来评估数据的偏离程度以判定离群值. z-score的计算方法为:
Z\left(x\right)=\frac{x-\mu }{\sigma }\text{,} (9) 其中, \mu ,\sigma 分别为原始数据的均值和标准差. 以z-score的绝对值作为离群值的判断指标且参考概率统计学意义上的高z-score,图5展示了ViT-S/DeiT-B/Swin-S推理过程中产生的每个层激活的高z-score(>3或>6)的占比情况,离群值占比最高约3%.
图5也从另一方面揭示了ViTs中数值分布的集中性规律,即至少97%以上的激活值在标准化空间内呈现紧密聚集特性,而且这种聚集性并不因激活数值分布的不同而改变.
ViT量化中的离群值低位宽表征以及数据聚集性启发本文建立离群值驱动的分段量化,即基于激活值呈现的“集中正常值+稀疏离群值”,DAQ采用分治量化策略,对正常值和离群值用均匀量化但关联的量化参数,既避免全局量化的模型性能下降,又能保持整体计算效率.
3.2 量化-计算分析
量化的另一个潜在优势是利用低位宽定点计算替代高位宽浮点运算从而获得计算加速. 首先需要识别计算密集型关键运算层,从而利用低位宽量化加速.
本文研究在Nvidia RTX 3090[27]硬件平台上,从ImageNet-1K测试集[28]中随机选取100张图片在ViT-S模型上进行前向推理,测试各计算层时延和平均参数开销占比及4-bit量化模型精度损失百分比,结果如表1所示.
表 1 ViT-S层精度下降/存储/计算占比分析Table 1. Percentage Analysis of Accuracy Reduction/Storage/Calculation proportion of ViT-S layer层 计算平均
时延占比/%参数+激活
存储占比/%量化精度
下降/%LN 6 5 32 QKV Gen 21 17 4 Softmax 2 8 84 Proj 9 8 9 FC1 27 22 40 GELU 4 10 65 FC2 31 30 98 注:最大值用黑体数值表示,最小值用下划线数值表示. 表1从计算、存储量化精度3方面研究了量化-计算开销错位的情形. 实验结果表明,仅关注数据的特殊分布而设计的量化方法在量化效能上可能是次优甚至低效的. 量化方法需要与计算实践相结合,才能有效获得存储\计算性能上的提升.
如表1所示,尽管Softmax激活输出呈现量化位宽高度敏感性(模型精度衰减达84%),但其关联层Proj的计算平均开销和存储开销分别仅占9%和8%. 这说明即使采用细粒度量化Softmax输出激活以保持模型精度,但对模型压缩和推理加速方面的贡献收益仍然十分有限.
相较而言,LayerNorm与GELU的激活输出作为QKV生成器和全连接层(FC1/FC2)的核心输入,其在存储占用和计算加速双维度均呈现显著优化潜力(二者计算延时和存储占比分别高达87%和75%). 其他如QKV Gen和Proj输出激活的量化对模型性能和计算能效的提升并不显著,模型精度保持也较好.
根据表1的综合分析,本文将针对LN和GELU的输出激活作为关键量化对象,充分考虑量化方法本身的复杂性和量化后数据的硬件友好性,提出了基于z-score方法的关联量化方法DAQ.
值得说明的是,虽然本文研究的量化算法关注LN和GELU的输出激活,但DAQ本质上提供了一种抽象的处理模型以应对ViT量化中的不同数据分布. 因此,DAQ不仅对除LN和GELU之外的层的激活量化仍然适用,也支持ViT中的呈典型高斯分布的权重量化,无疑增加了DAQ方法的统一性和普适性.
4. DAQ:基于z-score的自适应分治量化方法
DAQ采用了分治-量化的思路,以z-score方法作为分治基准,结合硬件友好的均匀关联量化算法. 图6描述了DAQ量化流程. 值得强调的是,虽然DAQ从量化能效最大化的角度出发选择LN和GELU输出激活作为关键量化对象,但DAQ方法对其他计算模块(如图6左侧ViT模块中部分填充的层)同样适用.
DAQ经由z-score将输入数据划分为正常值集X_1 与异常值集X_2 :
X_2=\left\{{x}_{i}|\,|Z\left({x}_{i}\right)| > \tau \right\},\;\;\tau \in {\mathbb{R}}^+. (10) ViT激活中隐含的跨分布的数据集中性使z-score方法突破了传统方法对先验分布假设的依赖,实现了跨分布的离群值检测方法. 尽管z-score方法简洁易行、计算高效(时间复杂度O(n)),但其在实际应用中存在3个问题:
1)统计量估计的冗余存储/计算开销. z-score方法中隐式使用了数据的均值 \mu 和标准差 \sigma . 求取数据全局标准差涉及到数据的二次访问和串行计算开销,加重了资源受限的设备的存储和访存压力.
2)单一静态全局阈值不匹配高动态变化激活值域. 当ViTs中的激活张量呈现出显著非高斯特性时,特别是长尾分布场景下,单一静态全局阈值难以适应其特征空间的动态统计特性.
3)数据规整性破坏引起的存储效率降低. 对正常/离群值集采用混合精度量化可能导致特征张量在内存中的非对齐存储,降低访存效率.
针对问题1),4.1节提出了利用数据极值估计标准差的方法,离线求解估测系数 \alpha ,简化标准差的计算过程,减少数据二次访问. 针对问题2),4.2节提出动态分布阈值 \tau 感知算法:通过离线校正阶段的代表性样本分布特征,对具有不同分布特征的数据动态自适应调节阈值 \tau . 同时,问题1)和问题2)中的参数可以通过联合优化进一步提升算法性能和效率. 4.3节提出的均匀关联量化用于解决问题3). 该量化方法保持数据存储的地址对齐特性,实现数据的高效存取.
为了便于后续行文表达,表2对5个常用的重要变量进行说明.
表 2 变量含义Table 2. Meanings of Variables变量名称 含义 \boldsymbol{X} 目标特征张量 {\boldsymbol{X}}_{1} 目标特征张量中的正常值集 {\boldsymbol{X}}_{2} 目标特征张量中的离群值集 \hat{\boldsymbol{X}} \boldsymbol X 的量化表示 \tilde{\boldsymbol{X}} \boldsymbol X 的反量化结果 4.1 标准差估计方法
为了减轻计算统计量均值 \mu 和标准差 \sigma 带来的计算/访存增加,同时避免引入高昂的除法代价[26],DAQ提出利用极差估计标准差的自适应方法. 这个方法的有效性来源于z-score方法允许标准差存在可接受的精度误差,如 {10}^{-2}\sim{10}^{-3} ,而不影响筛选结果.
通过均值 \mu 、P个极值和估测系数 \alpha 可以近似计算标准差 \sigma :
\sigma \approx \sqrt{\left(\displaystyle\sum_{i=1}^{p}{\left({M}_{i}-\mu \right)}^{2}+\alpha L\right)/L}. (11) 下面对式(10)的理论有效性进行推导. 为了便于说明,设待量化 \boldsymbol{X}={\boldsymbol{X}}_{1}\cup {\boldsymbol{X}}_{2} . 如3.1节中观察到的数值集中分布特性,正常值 {\boldsymbol{X}}_{1} 占比高达97%以上,离群值 {\boldsymbol{X}}_{2} 仅占少量.
待量化X的总体标准差 \sigma 满足:
L {\sigma }^{2}=\displaystyle\sum_{i=1}^{L}{\left({x}_{i}-\mu \right)}^{2}=\displaystyle\sum_{{x\in \boldsymbol{X}}_{1}}{\left(x-\mu \right)}^{2}+\displaystyle\sum_{{x\in \boldsymbol{X}}_{2}}{\left(x-\mu \right)}^{2}\text{,} (12) 其中L为待量化X的数据总量.
当 x\in {\boldsymbol X}_{2} 当时,不妨设 x={x}_{1}+{x}_{2},{x}_{2}\in {\boldsymbol X}_{2},{x}_{1}\in {\boldsymbol{X}}_{1}\cup \{0,\mu \} . 且我们将 {\boldsymbol{X}}_{2} 视作服从正态分布,式(12)可以表示为:
\begin{split} L{\sigma }^{2}=& \displaystyle\sum_{{x\in \boldsymbol{X}}_{1}}{\left(x-\mu\right)}^{2}+\displaystyle\sum_{{{x}_{1}\in \boldsymbol{X}}_{2}}{\left(\left({x}_{1}-\mu\right)+\left({x}_{2}-\mu\right)\right)}^{2} \approx\\ & \displaystyle\sum_{x\in \boldsymbol{X}}{\left(x-\mu\right)}^{2}+\displaystyle\sum_{{{x}_{1}\in \boldsymbol{X}}_{2}}{\left({x}_{1}-\mu\right)}^{2}=L{\hat{\sigma }}^{2}+\displaystyle\sum_{{x}_{1}\in {\boldsymbol{X}}_{2}}{\left({x}_{1}-\mu\right)}^{2}. \end{split} (13) 进一步地,用极值表示关于 {\boldsymbol{X}}_{2} 的和. 假设取绝对值最大的P个数据,那么有:
\displaystyle\sum_{{x}_{1}\in {\boldsymbol{X}}_{2}}{\left({x}_{1}-\mu\right)}^{2}=\displaystyle\sum_{i=1}^{P}{\left({topk}\left({\boldsymbol{X}}_{2}\right)\left[i\right]-\mu\right)}^{2}+\delta ,{x}_{1}\in {\boldsymbol{X}}_{2}. (14) \mathrm{\delta } 表示用P个极值表示后的值与原式的值之差. 将 \mathrm{\delta } 记为 {\delta }=\gamma L ,那么可以得到最后的计算结果:
L{\sigma }^{2}=\left({\hat{\sigma }}^{2}+\gamma \right)L+\displaystyle\sum_{i=1}^{P}{\left({topk}\left({\boldsymbol{X}}_{2}\right)\left[i\right]-\mu\right)}^{2}. (15) 至此,我们得到了用P个极值近似估计标准差的方法. 将 \alpha = ({\hat{\sigma }}^{2}+\gamma ) 视作变量,通过自适应算法求解 \alpha ,就能得到不同特征张量的标准差估计. 以最小化估计标准差与实际标准差之间的平方和作为优化目标,自适应求解估测系数 \alpha ,具体算法如算法1所示.
算法1. 自适应求解估测系数 \alpha .
输入:校正集C,估测系数 \alpha ,极值个数P,校正集数量c;
输出:估测系数 \alpha .
① 初始化 \alpha =1 ;
② {\rm for}\;i\;{\rm in}\;\left[0:c:1\right]\;{\rm then}
③ 计算 {C}\left[i\right] 的均值 \mu 和标准差 \sigma ;
④ \boldsymbol{P}=topk\left(abs\left({C}\right[i\left]\right),P\right) , \boldsymbol{P}\in \mathbb{R}^{P} ;
⑤ L=length\left({C}\right[i\left]\right) ;
⑥ \sigma'=\sqrt{{(\left(\boldsymbol{P}-\mu \right)}^{2}+\alpha L)/L} ;
⑦ {\alpha '}=\mathop{\rm arg\,min}\limits_{\alpha }{\|\sigma -\sigma'\|}_{2}^{2} ;
⑧ \alpha =(\alpha \times i+\alpha')/(i+1) ;
⑨ end for
⑩ return \alpha .
标准差估计算法减少了平方运算,但其更重要的意义在于避免了二次逐元素遍历数据,极值、均值求取可以同时完成. 在测试过程中发现,离线求解 \alpha 仅需要极少量数据即可完成(4~12个样本).
另一方面,算法中的超参数——极值个数P的取值并不是一个决定性因素. P的取值变化所引起的计算误差会自适应补偿到估测系数 \alpha 中. 在实际应用中,我们实验性地采用了 P=8,16,32 等较小数值,均能达到估计标准差与实际标准差的误差在 {10}^{-3} 之内(与校正集数据比较).
4.2 动态分布阈值 \tau 的感知算法
传统z-score方法采用全局静态阈值策略,假设所有张量数据服从单一分布特性. 这一假设违背了ViT权重以及激活值的分布异质性. 如图3所示,ViT模型中不同层的激活值呈现显著分布差异. 基于此,DAQ提出动态分布阈值 \tau 感知算法以推广z-score方法到所有权重和激活中.
阈值 \tau 优化目标为使反量化后的张量 \tilde{\boldsymbol{X}} 与原始张量X之间误差最小,如式(16)所示.
\mathop{\rm arg\,min}\limits_{\tau }{\|\boldsymbol{X}-\tilde{\boldsymbol{X}}\|}_{2}^{2}. (16) 算法2. 阈值 {\tau } 自适应算法.
输入:校正集C,估测系数 \alpha ,极值个数P,校正集数量c;
输出:阈值 \tau .
① {\rm for}\;i\;{\rm in}\;\left[0:c:1\right]\;{\rm then}
② 算法1计算均值 \mu 、标准差 \sigma 和极值列表 p ;
③ {\boldsymbol{X}}_{2}={C}\left[i\right] > \left(u+\tau \sigma \right)\left|{C}\right[i] < \left(u-\tau \sigma \right) ;
④ {\boldsymbol{X}}_{1}={C}\left[i\right]\setminus \boldsymbol{X}_2 ;
⑤ \tilde{\boldsymbol{X}}_{1}=Dequant\left(NormQuant\right({\boldsymbol{X}}_{1}\left)\right) ;
⑥ \tilde{\boldsymbol{X}}_{2}=Dequant\left(OutlierQuant\right({\boldsymbol{X}}_{2}\left)\right) ;
⑦ {\tau '}=\mathop{\rm arg\,min}\limits_{\tau }{\left\|(\tilde{\boldsymbol{X}}_{1}+\tilde{\boldsymbol{X}}_{2}\right)-\left({\boldsymbol{X}}_{1}+{\boldsymbol{X}}_{2})\right\|_{2}^{2}} ;
⑧ \tau =(\tau \times i+\tau')/(i+1) ;
⑨ end for
⑩ return \tau .
动态分布阈值 \tau 感知算法有2方面的核心优势:1)算法普适性. 阈值 \tau 将正常值与离群值区域分割,两者既可独立量化也可联合量化,适配任意量化方法;2)参数轻量化. 仅需离线求解阈值 \tau ,测试中通过滑动平均法(4~12个样本)快速收敛,避免复杂优化计算.
另外,算法1和算法2可以联合优化,同时求解估测系数 \alpha 和阈值 \tau . 自此,z-score计算所涉及到的参数都已求解完成.
z-score分域结果为与输入张量同样维度的离群值位图. 为了简化计算,筛选离群值时将求得的每个元素的z-score值与阈值 \tau 做比较转换为与阈值 \tau 代表的数值逐元素比较,减少了逐元素计算z-score,具体算法如算法3.
算法3. z-score方法筛选离群值.
输入:张量X,估测系数 \alpha ,极值个数P,阈值 \tau ;
输出:离群值位图M,极值列表p.
① 算法1计算均值 \mu 、标准差 \sigma 和极值列表 p ;
② up=\left(u+\tau \sigma \right) , down=\left(u-\tau \sigma \right) ;
③ {\rm for}\;i\;{\rm in}\;\left[0:L:1\right]\;{\rm then}
④ {\rm{if}}\;\boldsymbol{X}\left[i\right] > up,\;\boldsymbol{X}\left[i\right] < down then
⑤ M\left[i\right]= 1 ;
⑥ else
⑦ M\left[i\right]=0 ;
⑧ end if
⑨ end for
⑩ p=p+[up,down];
⑪ return M,p .
4.3 均匀关联量化
基于4.2节的z-score检测离群值后,原始数据被划分为正常值 {\boldsymbol{X}}_{1}=\{x\|Z(x)|\le \tau \} 与离群值 {\boldsymbol{X}}_{2}= \{x\|Z (x )| > \tau \} . 该过程同时记录全局极值 {x}_{\max} 与 {x}_{\min} 、正常值集极值 \mu +\tau \sigma 和 \mu -\tau \sigma . 这正是均匀关联量化所需要的关键边界值. 全局极值是正负离群值集中的最大值或最小值,而正常值集极值亦是正负离群值集的临界值. 这些边界值为后续量化参数计算提供统计基准.
针对 {\boldsymbol{X}}_{1} 与 {\boldsymbol{X}}_{2} 的数值特性,受到分治思想的启发,DAQ提出的均匀关联量化方法采用双域协同优化策略,其流程包含2个阶段:对z-score分域后的各域独立设置量化参数,再通过参数调整提升计算效率. 具体算法过程如图6所示.
如图7所示,空白框图描述了各域的均匀量化步骤. 根据区间极值列表p和目标量化位宽b,分别计算正常值集、正离群值、负离群值集3个区间的缩放因子s, {s}_{+} , {s}_{-} 以及零点偏移z, {z}_{+} , {z}_{-} 后量化X.
图7中阴影框图则为提升计算效率而设计. 多尺度量化参数阻碍了直接利用量化后数据参与计算. 为保持线性计算的高效性,需要对缩放因子和零点偏移做正常值集和离群值集间的关联调整.
对缩放因子引入归一化缩放系数 {k}_{\pm } ,采用幂次缩放对齐策略求解相应的正/负离群值的归一化因子:
{s}_{\pm }{2}^{\left\lceil {{{\rm lb}}}{\tfrac{{s}_{\pm }}{s}}\right\rfloor }={2}^{{k}_{\pm }}\cdot s (17) 通过将离群值集的缩放系数调整为正常值集的2的幂次方倍,缩放因子尺度不同带来的影响可以通过简单的移位操作进行消除.
另一方面,零点偏移而引起的乘法交叉项可能减弱低位宽量化数据计算带来的增益. 关于零点偏移如何影响计算将在5.1节中详细阐述. 本节只阐释零点偏移如何产生以及如何利用平移补偿消除零点偏移.
数据平移并不改变正常值集和异常值集的区分和极值区间长度. 记 {\varDelta } ,{{\varDelta }}_{+} , {{\varDelta }}_{-}\in \mathbb{R}分别为正常值集、正离群值集、负离群值集的补偿偏移.
图8中横轴表示待量化X的数值范围,左、右两侧黑色间隔分别表示量化后的负、正离群值,中间浅色刻度表示量化后的正常值. 横轴“0”为真实0值. 零点偏移刻画的是量化值“0”与真实值“0”之间的偏移量. 原始数据除以缩放因子后得到的量化数值分布如图8(a)所示. 量化数值需减去零点偏移以满足量化位宽b的表示范围,如图8(b).
离群值集共享量化的 {2}^{b} 个数值,故两者量化值恰好以量化后零点为分界点. 图8(a)表示正负离群值的零点偏移到正常值的零点是相同且固定的,距离为 {2}^{b-1} . 故有:
{z}_+=z+{2}^{b-1},\;\;{z}_-=z-{2}^{b-1}. (18) 而正常值集的零点偏移z由均值和缩放因子计算得到:
z=\left\lceil \frac{\mu }{s}\right\rfloor. (19) 因此,可以通过z 计算所有的补偿偏移. 当 z=0 时,在缩放因子s的尺度下, {\varDelta }=0,{{\varDelta }}_{+}=-{2}^{b-1}, {{\varDelta }}_{-}=+{2}^{b-1} . 当 z\ne 0 时,说明数据整体偏离真实值“0”,平移补偿则将数据整体平移回真实值“0”. 同样在缩放因子s的尺度下,有 {\varDelta }=z,{{\varDelta }}_{+}=-z-{2}^{b-1},{{\varDelta }}_{-}=-z+{2}^{b-1} . 经过平移补偿 {\varDelta } 后有: \boldsymbol X={\boldsymbol X'}+{\varDelta },{\varDelta }\in \mathbb{R} .
为了减少对正负离群值的符号表达,均匀关联量化中所有量化值均采用带符号的量化表达. 以DAQ中4 bit量化为例,其符号值域如表3所示.
表 3 4比特对称均匀量化Table 3. 4-bit Symmetric Uniform Quantization数值类型 值域 量化参数 正常值 0,\pm 1,\pm 2,\pm 3,\pm 4,\pm 5,\pm 6,\pm 7, -8 s z 离群值* 0,\pm 1,\pm 2,\pm 3,\pm 4,\pm 5,\pm 6,\pm 7, -8 s′ z' 正离群值 0, \mathrm{1,2}, 3, 4, 5, 6, 7, 8 s1 z_1 负离群值 -1,-2,-3,-4,-5,-6,-7,\mathrm{ }\mathrm{ }\mathrm{ }\mathrm{ }\mathrm{ }\mathrm{ }\mathrm{ }-8 s2 z_2 注:“*”表示当仅存单侧正/负离群值时,不加以符号区分. 5. 线性计算
本节在第4节的基础上讨论基于现有硬件加速架构,充分利用DAQ量化后的低位宽定点计算来降低计算开销.
5.1 量化乘法
本节讨论2个均匀量化后的数据相乘可能的情形. 其计算结果可以表示为:
val=\left(x+{z}_{x}\right){s}_{x}\cdot (y+{z}_{y} ){s}_{y}. (20) 根据零点偏移z 的取值情况,式(20)有3种情形:
1) {z}_{x}=0, {z}_{y}=0
当原始数据的均值\mu 趋于0时,量化乘法运算满足完美缩放条件:
val=\left(xy\right) ({s}_{x}{s}_{y} ). (21) 此时量化乘法与浮点乘法结果等效且不需要额外的计算补偿,因此计算效率提升最大.
2) {z}_{x}\ne 0,{z}_{y}=0 或者 {z}_{x}=0,{z}_{y}\ne 0
以 {z}_{x}=z,{z}_{y}=0 为例,有:
val=xy{s}_{x}{s}_{y}+zy{s}_{y}. (22) 式(22)中z 为常量,故增加的偏移部分 zy 为可提前计算的固定值. 式(22)中的2个求和项之间也不存在数据依赖,量化乘法和常量偏移补偿可并行计算,即“先计算(乘法),后补偿(加法)”.
3) {z}_{x}\ne 0,{z}_{y}\ne 0
双零点偏移若直接用量化数据计算,会因较多交叉项的计算导致整体计算效率降低. 此时将INT4量化值动态扩展到INT8以消除双零点偏移后再计算,即“先补偿,后计算”.
{x'}=x+{z}_{x},\;\;{y'}=y+{z}_{y},\;\; val={x'}{y'}{s}_{x}{s}_{y}. (23) 虽然情况3)为了消除零点偏移产生的交叉项将数据扩展到INT8表示,只是为了将零点偏移提前补偿回INT4量化中,并不意味着此时的量化位宽增长了1倍.
5.2 GPU Tensor Core计算加速
在NVIDIA Ampere[27]架构图形处理器(graphics processing unit,GPU)中,第三代Tensor Core实现了对INT4/INT8整型矩阵乘累加(matrix multiply-accumulate,MMA)运算的硬件级支持. 以NVIDIA GeForce RTX 3090(基于GA102 GPU)[27]为例,其包含82个流式多处理器(streaming multiprocessor,SM),每个SM集成4个Tensor Core模块,提供总计328个Tensor Core,支持混合精度计算模式(INT输入,FP16/FP32累加器).
量化后的INT数据有望通过Tensor Core实现硬件支持的INT4/INT8计算加速. 模型权重参数具有高斯近零分布特性,对称均匀量化时零点偏移z 自然满足 z=0 . DAQ对激活值虽采用均匀量化方法,但原始数据分布却不一定基于原点对称. 这就导致计算时需要考虑零点偏移的情形.
若正常值集零点偏移 z=0 ,可参照第5.1节中的情形2)并结合式(15)的参数归一化,可以利用Tensor Core所支持的INT4计算单元直接对量化后数据进行计算,通过 CUDA Core并行计算少量离群值量化中产生的固定偏移补偿 ({2}^{b-1}\cdot s) . Tensor Core计算过程为:
val=xy{2}^{{k}_{x}}{2}^{{k}_{y}}{s}_{x}{s}_{y};\;{k}_{x},\;\;{k}_{y}\in \left\{0,{k}_+,{k}_-\right\}. (24) 如果正常值集零点偏移z\ne 0 ,则属于5.1节中的情形3). CUDA Core计算补偿项的开销超过了INT4计算加速. 此时应先消除零点偏移,将量化INT4数据扩展为INT8(加上零点偏移后),再采用INT8的Tensor Core进行计算.
Tensor Core中INT4/INT8矩阵乘加运算对输入矩阵的分块大小有固定限制. PTX指令形如: mma.sync. aligned.m8n8k16.row.col.s32.s8.s8.s32 . m8n8k16指示矩阵计算中的分块大小,s8/s32指示输入和累加器数据精度,row/col指示输入数据分别以行/列主序存储. Tensor Core支持的INT4和INT8矩阵分块如表4所示. 其中,ViT激活的通道维度与权重维度都能被分块大小所整除(32/64),但激活的Token维度(197)需要填充以适应Tensor Core所支持的分块大小.
表 4 Tensor Core计算分块和ViT线性计算维度Table 4. Computing Patch of Tensor Core and Linear Computational Dimensions of ViT数据格式 Tensor Core矩阵分块
M,N,KViT线性计算(部分)
M,N,KINT8 16,16,16 197,192,384 32,8,16 197, 1536 ,3848,32,16 197, 3072 ,768INT4 8,8,32 197, 2304 ,7686. 实 验
6.1 实验设置
本文实验部分将在ViT[2],DeiT[3]和Swin[4]及其变体上开展. 图像分类任务所选用的实验数据集为ImageNet-1K[28]测试集,目标检测任务则使用COCO 2017数据集[29]. 图像分类任务的所有全精度模型来自Timm库. 目标检测任务的全精度模型来自MMDetection[30]. 为了与之前的研究方法保持一致,DAQ在校正过程中随机选取32个样本作为校正集,COCO数据集上选择1个样本作为校正数据集,同时采用与文献[14]相同的离线量化方案的权重,Softmax输出激活保持原始精度. 需要说明的是,DAQ可以使用更少的样本(如8个或12个)完成校正. 值得强调的是,虽然DAQ认为量化LN和GELU的输出最有利于量化-计算-存储性能的提升,但这不代表DAQ仅适用于LN和GELU层. 事实上,DAQ方法对其他层的激活输出仍具备普适性. 在实验过程中先将DAQ应用于LN和GELU以得到模型量化结果,随后给出DAQ对其他层的量化结果.
6.2 ImageNet数据集上的量化模型性能
DAQ与现有研究在图像识别任务上的量化模型精度比较结果见表5. 表5中每项表示量化模型图像识别任务的Top-1精度,精度越高表示识别率越高,模型性能越好. 量化层表示各个研究工作所主要针对的量化目标.
表 5 量化模型在图像分类任务的ImageNet数据集上的Top-1准确率Table 5. Top-1 Accuracy of the Quantization Models on ImageNet Dataset of the Image Classification Task方法 量化目标 位宽 准确率 ViT-S/% ViT-B/% DeiT-T/% DeiT-S/% DeiT-B/% Swin-S/% Swin-B/% 全精度模型 32/32 81.39 84.54 72.21 79.85 81.80 83.23 85.27 FQ-ViT[10] S/N 4/4 0.10 0.10 0.10 0.10 0.10 0.10 0.10 PTQ4ViT[11] S/G 4/4 42.75 30.96 36.96 34.08 64.39 76.09 74.02 APQ-ViT[13] S 4/4 47.95 41.41 47.94 43.55 67.48 77.15 76.48 RepQ-ViT[14] S/N 4/4 65.05 68.48 57.43 69.03 75.61 79.45 78.32 RepQuant[15] S/N 4/4 73.28 77.84 64.44 75.21 78.46 81.52 82.80 DAQ* G/N 4/4 75.06 80.36 67.15 75.46 78.52 80.43 81.45 DAQ* S/N 4/4 73.29 82.00 68.81 75.45 80.08 79.95 82.30 FQ-ViT[10] S/N 6/6 4.26 0.10 58.66 45.51 64.63 66.50 52.09 PTQ4ViT[11] S/G 6/6 78.63 81.65 69.68 76.28 80.25 82.38 84.01 APQ-ViT[13] S 6/6 79.10 82.21 70.49 77.76 80.42 82.67 84.18 RepQ-ViT[14] S/N 6/6 80.43 83.62 70.76 78.90 81.27 82.79 84.57 RepQuant[15] S/N 6/6 80.51 83.75 70.89 79.06 81.41 82.93 84.86 DAQ(本文) G/N 6/6 80.81 84.09 71.69 79.56 81.63 82.31 83.98 DAQ(本文) S/N 6/6 80.86 83.85 71.74 79.61 81.94 82.89 84.34 注:列中的缩写S,G,N分别代表Softmax,GELU,LayerNorm层输出激活,黑体数值表示. FQ-ViT在4 bit量化时模型崩溃,这可能是由于低位宽计算过程中引入的误差过大. 相比之下,PTQ4ViT,APQ-ViT的量化模型达到了全精度模型一半的精度;RepQ-ViT和RepQuant明显提升了量化模型精度. 但DAQ在大多数模型上的量化效果仍然高于目前最优的RepQ系列方法,单个模型最大提升高达4.37个百分点(DeiT-T).
DAQ主要关注的是4 bit低位宽量化,但其在6 bit量化模型上的效果保持了高准确率. 与其他方法相比,DAQ在超过半数的量化模型达到最好精度. 而且,各个量化模型与全精度模型的平均准确率误差仅有0.4%,甚至在DeiT-B模型上超过全精度模型0.14个百分点.
在不同量化位宽下,DAQ无论是在G/N或S/N上都展现出了相似的高准确率,表明了DAQ方法的普适性.
6.3 COCO数据集上量化模型性能
DAQ和现有SOTA量化算法在目标检测任务上的结果如表6所示. 实验所采用的数据集为COCO 2017,Mask R-CNN分别用Swin-T和Swin-S作为骨干网络. 表6中指标 A{P}^{\rm box} (average precision of bounding box)和 {A}{{P}}^{\rm{m}{a}{s}{k}} (average precision of mask)分别表示目标检测任务中检测框定位准确率和实例分割任务中像素级轮廓匹配度,指标越高表示模型性能越好.
表 6 量化模型在目标检测任务的COCO数据集上的性能Table 6. Performance of the Quantization Models on COCO Dataset in the Object Detection Task方法 量化目标 位宽 Mask R-CNN Swin-T Swin-S {AP}^{\rm box} /% {AP}^{\rm mask} /% {AP}^{\rm box} /% {AP}^{\rm mask} /% 全精度模型 32/32 46.0 41.6 48.5 43.3 PTQ4ViT[11] S/G 4/4 6.9 7.0 26.7 26.6 APQ-ViT[13] S 4/4 23.7 22.6 44.7 40.1 RepQ-ViT[14] S/N 4/4 36.1 36.0 44.2 40.2 RepQuant[15] S/N 4/4 37.2 36.8 44.5 40.5 DAQ(本文) G/N 4/4 45.4 41.3 47.9 43.0 DAQ(本文) S/N 4/4 45.7 41.4 48.1 43.2 PTQ4ViT[11] S/G 6/6 5.8 6.8 6.5 6.6 APQ-ViT[13] S 6/6 45.4 41.2 47.9 42.9 RepQ-ViT[14] S/N 6/6 45.1 41.2 47.8 43.0 RepQuant[15] S/N 6/6 45.3 41.3 48.1 43.2 DAQ(本文) G/N 6/6 46.0 41.6 48.2 43.2 DAQ(本文) S/N 6/6 46.0 41.7 48.2 43.1 注:列中的缩写S,G,N分别代表Softmax,GELU,LayerNorm层输出激活,黑体数值表示. 表6结果显示,DAQ的4 bit量化在目标检测任务上全面超越了之前的研究工作,各项指标提升8.2~2.5个百分点,与全精度模型的最大误差为0.6个百分点,在低位宽4 bit量化上实现了近乎无损的量化模型.
以Swin-T作为Mask R-CNN的主干网络进行目标检测时,4 bit量化模型的 A{P}^{\rm box} 提升高达8.2个百分点,6 bit量化模型在实例分割上的性能指标 A{P}^{\rm mask} 甚至超过全精度模型0.1个百分点.
另一方面,DAQ将量化目标设为S/N或者G/N上都得到了相当的SOTA性能. 2个模型间的性能差异在 +0.3\sim -0.1 个百分点. 实验结果充分证明了DAQ的普适性和有效性.
6.4 消融实验
为了说明DAQ分治策略和动态感知的z-score方法的有效性,在ViT-S,DeiT-S,Swin-S三个模型上展开消融实验. 实验采用4 bit量化位宽,分别对模型以Min-Max量化和DAQ量化但采用传统阈值(DAQ W/O Adaptive)、DAQ量化3种方法比较. 图10中的柱状图表示每类方法在单一类型层输出激活量化的模型准确率,点线表示对模型的LN和GELU层激活输出同时量化的准确率.
图10中,无论是单层量化还是联合量化,Min-Max量化的量化效果最差,最低准确率仅有6.45%. 这与第3节中的发现,均匀量化在低位宽中难以表征高动态范围的激活离群值,导致模型准确率大幅下降相一致. 当DAQ采取分治策略,对离群值和正常值加以区分后,模型的准确率得到显著提升,这一点在ViT-S和DeiT-S模型上得到充分表征,几乎所有量化结果都得到了1个数量级的性能提升. DAQ采用动态感知的方法调整阈值后,模型准确率进一步得到提升,逼近量化前模型性能.
从图10还可以观察到一个非常有趣的现象,那就是Swin模型的抗量化误差能力比前2类模型要好. 这与Swin模型具有更多模型参数和更复杂的模型结构有关.
6.5 推理加速性能分析
ViT的计算开销主要用在线性计算. 为了说明DAQ对硬件的适配,表7展示了线性计算层的平均时延. 以 ImageNet-K1数据集为输入,在NVIDIA RTX 3090 GPU上随机推理100张图片后的平均时延. 以量化+FP32计算作为基准,DAQ表示结合INT8/INT4定点计算的归一化结果.
表 7 线性计算平均时延归一化Table 7. Normalization of Average Latency in Linear Calculation矩阵大小(M/N/K) 量化+FP32 DAQ 197/192/192 1 0.57 197/384/192 1 0.51 197/ 1536 /3841 0.30 197/ 2304 /7681 0.23 197/ 3072 /7681 0.20 197/ 3072 /1024 1 0.19 197/ 4096 /1024 1 0.16 由表7所知,在未结合硬件设计时,在线反量化所引入的开销随着线性计算规模的增加而增大. 基于GPU Tensor Core 加速后,DAQ方法能获得43%~86%的线性计算性能提升. 同时,从表7我们还可以观察到,DAQ的性能增益并未随着线性计算规模的增大而线性增加. 具体来说,矩阵规模增大约120倍(从(197/192/192)到(197/
4096 /1024 )),性能却只增长了3.5倍(从0.57到0.16).另一方面,表8以DAQ为基准,比较了DAQ和PTQ4ViT,RepQ-ViT端到端的推理性能以及在RTX 3090上对ImageNet-1K的val数据集(共
50000 张图片)推理的总耗时.表 8 ImageNet-1K测试集时间开销比较Table 8. Time Overhead Comparison of ImageNet-1K Test Sets 模型 DAQ PTQ4ViT[11] RepQ-ViT[14] ViT-S 70.49 88.21(+25.1%) 91.19(+29.4%) ViT-B 167.67 209.43(+24.9%) 219.83(+31.1%) DeiT-T 32.30 45.81(+41.8%) 44.03(+36.3%) DeiT-S 65.43 88.25(+34.8%) 88.93(35.9%) DeiT-B 167.02 209.48(+25.4%) 214.60(28.5%) Swin-T 104.57 130.26(+24.5%) 121.36(+16.1%) Swin-S 168.70 205.2(21.6%) 197.03(16.8%) 注:括号中数值表示当前方法相较于DAQ方法的时间增幅百分比,数值越大,表明推理延迟相对于DAQ越高. 表8显示,与PTQ4ViT相比,DAQ最大性能提升高达41.8%,平均性能提升超过28%;DAQ相比RepQ-ViT的时间开销也降低了36.3%~16.8%,平均推理时延减少27%. 与线性计算结果相一致,当模型越大,线性计算开销占比越高时,DAQ的模型增益也有所放缓.
因此,量化性能受限于2个方面:1)在线量化开销,量化算法设计越精细,增加的非规则计算越多,算法引入的端到端延迟开销越高;2)利用Tensor Core等计算核心对量化算法加速时,矩阵分块形式、数据组织和传输都影响着最终加速效果.
6.6 离线校正效率分析
表9展示了不同方法之间的离线校正开销对比,包括校正样本数和校正时间开销2个方面. 离线校正时间开销基于单卡3090 GPU,校正样本集由ImageNet-1K中随机抽取.
表 9 离线校正效率分析Table 9. Efficiency Analysis of off-line Calibration模型 方法 Top-1/% 校正样本数 GPU时间/min DeiT-S FP32 79.85 PTQ4ViT[11] 34.08 32 3.2 RepQ-ViT[14] 69.03 32 1.3 DAQ(本文) 75.46 32 1.2 DAQ(本文) 75.12 12 0.5 DAQ(本文) 75.08 4 0.1 Swin-S FP32 83.23 PTQ4ViT[11] 76.09 32 7.7 RepQ-ViT[14] 79.45 32 2.9 DAQ(本文) 80.43 32 2.5 DAQ(本文) 80.21 12 0.9 DAQ(本文) 80.01 4 0.3 注:黑体数值表示最优值. DAQ在离线校正中采用网格搜索确定超参数,单次查找计算复杂度由网格搜索的范围所决定. 因此DAQ离线校正的时间开销与样本数量呈正比. 但DAQ离线校正的显著优势在于仅小样本数据即可保证离线校正后模型的最终准确率.
从表9可以看出,在同样的32个样本下,DAQ的时间开销最短且能得到相对最佳模型准确率. 另一方面,DAQ在样本量大幅减少到12或4时,校正时间线性减少的同时仍然能保持模型准确率(降低小于1%). 这也意味着DAQ在少样本条件下适应性仍然良好.
6.7 权重量化的适用性
DAQ的重要特性之一是兼容具有不同分布的数据. 因此DAQ不仅适用于动态激活量化,同样也适用于静态权重量化. 但ViT权重为静态数据且高度符合高斯分布,因此关于权重量化有许多成熟的方法如GPTQ[31],AdaRound[32]等且能取得较好的量化效果. 因此权重量化并不作为DAQ 的重点解决的问题. 另一方面,为了与SOTA方法保持一致进行公平的比较,6.2~6.6节中的权重量化均采用与文献[14]相同的量化方法.
本节为了展示DAQ对静态权重量化的适用性,基于ImageNet-K1数据集设计了对应的量化对比实验. 表10中,DAQ权重量化仍采用与RepQ-ViT相同的方法[16],DAQ/W表示模型权重与激活均采用DAQ量化.
表 10 DAQ应用于ViTs权重量化Table 10. Applied DAQ to ViTs Weight Quantization% 方法 ViT-S ViT-B DeiT-S DeiT-B Swin-S Swin-B FP32 81.39 84.54 79.85 81.80 83.23 85.27 RepQ-ViT[16] 65.05 68.48 69.03 75.61 79.45 78.32 DAQ 73.29 82.00 75.45 80.08 79.95 82.30 DAQ/W 70.32 73.00 75.14 79.80 79.90 79.64 表10显示,DAQ/W同时应用在权重与激活上时,比仅在激活上使用DAQ时模型准确率有所降低,降幅最高达9.00个百分点,但最低仅为0.05个百分点. 尽管如此,DAQ/W的结果仍然全面超越采用文献[16]的RepQ-ViT.
7. 结 论
本文针对视觉Transformer(ViT)系列模型的低比特激活量化难题,研究了ViT关键模块的量化误差与计算/存储开销的错位匹配,观察到激活不依赖于数据分布的集中共性,从而提出分治自适应量化方法DAQ,系统性解决了后训练量化中离群值表征、计算-精度失配及硬件协同适配等核心挑战. 理论分析与实验验证表明,DAQ在绝大多数情况下达到最优水平,ImageNet任务中4 bit量化的Top-1精度较现有SOTA方法最大提升4.37个百分点,特定条件下量化模型精度超过全精度模型,同时在目标识别任务上达到近似无损的低位宽量化. DAQ设计的关联量化方法与Tensor Core适配计算流消除了线性计算中的反量化过程,降低了量化引入的计算开销,能够获得43%~86%的线性计算加速. DAQ不仅揭示了非高斯分布下数值聚集性与量化误差传播的关联机制,更为边缘设备部署高精度、低功耗视觉Transformer提供了理论指导与工程实践闭环方案.
8. 讨 论
本文针对以下3个方面进行了讨论:
1)ViT量化对性能的提升. 仅追求低比特量化并不总能带来性能上的提升. 量化虽然能减少访存和带宽需求,但端侧部署还需要考虑计算实时性、能耗等方面. 同时,现有硬件不能自适应地提供低位宽量化数据的高效计算支持,而复杂精妙的量化方法如混合精度量化对硬件计算带来更高的挑战. 最后,PTQ量化为了保持精度,仍需要量化/反量化的额外计算,增加了计算负载.
2)ViT量化所产生的稀疏性. 原始数据中靠近均值的数值均匀量化后为0,因此量化数据的稀疏度会比原始数据要高. 仅依靠现有的计算架构如GPU无法很好地利用这样的非结构化稀疏. 因此,量化算法想要切实带来计算方面的性能提升,必须要考虑底层硬件架构的适配和设计.
3)ViT量化对新兴视觉模型的借鉴意义. ViT在计算机视觉领域中的广泛适用性催生了许多ViT的跨模态、跨网络结构融合的新兴视觉模型. 以Sora[33]为代表的扩散模型从DDPM[34]转向Diffusion Transformer(DiT)[33]为例,基于ViT的U-ViT架构已展现出替代传统U-Net的潜力. 当以ViT作为新兴模型的骨干网络时,ViT的二次复杂度、数据非高斯分布、存在离群性的特性是由模型本身所决定的,因此这些特性只会在表现强度上有所不同而不会完全消失. 成熟的量化方法在新兴模型中也能发挥相应的作用,但仍要结合ViT模型在新兴模型中结构的变形、融合和增改等引起的独特特性加以综合分析与考量. 比如DETR目标检测模型融合了ViT和Decoder结构,只针对ViT量化的方法在DERT上并不总能在保证精度的前提下获得相应的量化模型.
作者贡献声明:吕倩茹、许金伟负责方案整体设计并撰写论文;吕倩茹、许金伟、姜晶菲负责部分算法思路和实验方法;姜晶菲、李东升提出指导意见并修改论文.
-
表 1 典型静态图处理系统
Table 1 Typical Static Graph Processing Systems
平台及类别 典型系统 数据表示 编程模型 核心优化 实现算法 CPU 存内图处理 Ligra[2] CSR 点中心同步 混合Push和Pull模式间动态切换 BFS,BC,Radii,CC,PR,SSSP Galois[3] CSR 点中心异步 拓扑感知任务调度和优先级调度 BFS,CC,PR,SSSP,BC GraphMat[39] DCSC 点中心同步 基于SpMV实现图处理 PR,BFS,TC,CF,SSSP Polymer[117] CSR 点中心同步 面向NUMA,减少远程内存访问 PR,SpMV,BP,BFS,CC,SSSP GraphIt[4] CSR,CSC 点中心同步 图计算领域编程语言,混合优化 BFS,CC,CF,PR-D 此外,Gagra[118],Ligra+[119]等. 核外图处理 GraphChi[5] EL 点中心异步 并行滑动窗口,优化对硬盘的访问 PR,SpMV,CC,TC,CF X-Stream[76] EL 边中心同步 以边为中心计算,以流的方式处理边 CC,SSSP,SpMV,PR,BP,ALS GridGraph[6] EL 边中心同步 两级图划分,双滑动窗口,减少I/O BFS,WCC,SpMV,PR LUMOS[120] EL 点中心异步 提出依赖驱动的核外图处理 PR,CoEM,DP,BP,WPR,LP GraphSD[98] CSR 点中心异步 状态和依赖感知的更新策略 PR,PR-D,CC,SSSP 此外,VENUS[121] ,CLIP[122],MultilogVC[123]等. 分布式图处理 Pregel[1] CSR 点中心BSP同步 高效、易用、可扩展性强 PR,SSSP,CC,LP PowerGraph[72] CSR 点中心GAS同步 点中心划分实现负载均衡,减少通信 SSSP,ALS,PR Gemini[7] CSR+CSC 点中心同步 高效内存管理,通信技术及计算模型 PR,CC,SSSP,BFS,BC FLASH[8] CSR 点中心同步 简单、高效地分布式图分析编程模型 BFS,KC,CC,BC,MM,TC,等
(70多种算法)此外,PowerLyra[124]等. GPU 存内图处理 Medusa[125] CSR 点中心 同步 使用CSR表示和消息日志减少读开销 BFS,SSSP Cusha[73] CSR 点中心ICU同步 提供G-shard和CW机制,合并访存 BFS,SSSP,CC,SSWP,NN,… Gunrock[9-10] CSR+SOA 数据中心同步 AFC编程模型和高性能原语,混合调度 BFS,SSSP,BC,PR,CC SEP-graph[12] CSR+CSC 点中心混合 Push/Pull, Sysn/Async, TD/DD混合处理 PR,SSSP,BFS,BC SIMD-X[42] CSR 点中心ACC同步 运行时任务管理 BFS,PR,SSSP,k-Core GSWITCH[126] CSR+CSC 点中心同步 机器学习模型自动调优组合图模式 BFS,CC,PR,SSSP,BC G2[127] CSR+CSC 点中心同步 图计算领域原语,混合多种优化 PR,CC,BC,SSSP,BFS 此外,IrGL [127]等. 核外图处理 Totem[128-129] CSR 点中心同步 低度数据GPU处理高度数据CPU处理 BFS,PR,BC,SSSP Graphie[130] EL 边中心异步 异步边流输传输较少开销,重命名技术 BFS,SSSP,CC Subway[13] CSR 点中心异步 子图生成,减少数据传输(CPU-GPU) SSSP,SSWP,BFS,CC,PR,… EMOGI[103] CSR 点中心同步 基于零拷贝,优化图算法提高数据传输 SSSP,BFS,CC,PR HyTGraph[14] CSR 点中心异步 混合数据转换管理,代价感知异步调度 PR,SSSP,CC,BFS 此外,GTS[131],GraphReduce[91],Garaph[132],HALO[99],Grus[133]等. 分布式图处理 LUX[134] CSR 点中心同步 数据重划分以实现GPU间负载均衡 PR,CC,SSSP,BC,CF Gluon[135] CSR 点中心同步 混合划分优化通信 BFS,CC,PR,SSSP Groute[15-16] CSR 点中心异步 借鉴路由思想通信,实现异步图处理 BFS,SSSP,PR,CC GUM[17] CSR 点中心同步 负载迁移的高效多GPU图分析系统 BFS,SSSP,PR,WCC 此外,MultiGraph[41],SWARMGRAPH[136]等. 表 2 典型静态图处理系统性能
Table 2 Performance of Classic Static Graph Processing Systems
系统 平台 数据集 BFS性能/GTEPS SSSP性能/GTEPS Ligra 40核Intel CPU E7-8870 Twitter[173] 4.579 0.553 rMat27[36] 5.012 0.526 GraphSD 2×8核Intel CPU E5-2620 Uk-union[174] 0.010(PageRank) 0.001 rMat30[36] 0.009(PageRank) 0.001 FLASH 20节点,单节点30核Intel CPU E5-2682 v4 Uk-2002[175] 0.132 0.043(TC) soc-orkut[176] 0.334 0.035(TC) Tigr 8 GB,P4000 GPU soc-orkut[176] 2.117 0.760 Pokec[177] 3.010 1.464 HyTGraph 11 GB,GTX 2080Ti GPU Twitter[174] 2.306 0.938 UK-2007[174] 3.761 1.191 GUM 32 GB V100×8 NvLink Web2001[178-179] 25.147 2.587(PageRank) road-USA[175] 0.082 0.784(PageRank) 注:TC(triangle counting). -
[1] Malewicz G, Austern M H, Bik A J C, et al. Pregel: A system for large-scale graph processing[C]//Proc of the 2010 ACM SIGMOD Int Conf on Management of data. New York: ACM, 2010: 135–146
[2] Shun Julian, Blelloch G E. Ligra: A lightweight graph processing framework for shared memory[C]//Proc of the 18th ACM SIGPLAN Symp on Principles and Practice of Parallel Programming. New York: ACM, 2013: 135–146
[3] Nguyen D, Lenharth A, Pingali K. A lightweight infrastructure for graph analytics[C]//Proc of the 24th ACM Symp on Operating Systems Principles. New York: ACM, 2013: 456–471
[4] Zhang Yunming, Yang Mengjiao, Baghdadi R, et al. GraphIt: A high-performance graph DSL[J]. Proceedings of the ACM on Programming Languages, 2018, 2: 1−30
[5] Kyrola A, Blelloch G, Guestrin C. GraphChi: Large-scale graph computation on just a PC[C]//Proc of the 10th USENIX Conf on Operating Systems Design and Implementation. Berkeley, CA: USENIX Association, 2012: 31–46
[6] Zhu Xiaowei, Han Wentao, Chen Wenguang. GridGraph: Large-scale graph processing on a single machine using 2-level hierarchical partitioning[C]//Proc of the 2015 USENIX Annual Technical Conf. Berkeley, CA: USENIX Association, 2015: 375–386
[7] Zhu Xiaowei, Chen Wenguang, Zheng Weimin, et al. Gemini: A computation-centric distributed graph processing system[C]//Proc of the 12th USENIX Conf on Operating Systems Design and Implementation. Berkeley, CA: USENIX Association, 2016: 301–316
[8] Li Xue, Meng Ke, Qin Lu, et al. FLASH: A framework for programming distributed graph processing algorithms[C]//Proc of the 39th Int Conf on Data Engineering (ICDE). Piscataway, NJ: IEEE, 2023: 232−244
[9] Wang Yangzihao, Davidson A, Pan Yuechao, et al. Gunrock: A high-performance graph processing library on the GPU[C]//Proc of the 21st ACM SIGPLAN Symp on Principles and Practice of Parallel Programming. New York: ACM, 2016: 1−12
[10] Wang Yangzihao, Pan Yuechao, Davidson A, et al. Gunrock: GPU graph analytics[J]. ACM Transactions on Parallel Computing, 2017, 4(1): 1−49
[11] Sabet A H N, Qiu Junqiao, Zhao Zhijia. Tigr: Transforming irregular graphs for GPU-friendly graph processing[C]//Proc of the 23rd Int Conf on Architectural Support for Programming Languages and Operating Systems. New York: ACM, 2018: 622–636
[12] Wang Hao, Geng Liang, Lee R, et al. SEP-graph: Finding shortest execution paths for graph processing under a hybrid framework on GPU[C]//Proc of the 24th Symp on Principles and Practice of Parallel Programming. New York: ACM, 2019: 38–52
[13] Sabet A H N, Zhao Zhijia, Gupta R. Subway: Minimizing data transfer during out-of-GPU-memory graph processing[C]//Proc of the 15th European Conf on Computer Systems. New York: ACM, 2020: 1−16
[14] Wang Qiange, Ai Xin, Zhang Yanfeng, et al. HyTGraph: GPU-accelerated graph processing with hybrid transfer management[C]//Proc of the 39th Int Conf on Data Engineering (ICDE). Piscataway, NJ: IEEE, 2023: 558−571
[15] Ben-Nun T, Sutton M, Pai S, et al. Groute: An asynchronous multi-GPU programming model for irregular computations[C]//Proc of the 22nd ACM SIGPLAN Symp on Principles and Practice of Parallel Programming. New York: ACM, 2017: 235–248
[16] Ben-Nun T, Sutton M, Pai S, et al. Groute: Asynchronous multi-GPU programming model with applications to large-scale graph processing[J]. ACM Transactions on Parallel Computing, 2020, 7(3): 1−27
[17] Meng Ke, Geng Liang, Li Xue, et al. Efficient multi-GPU graph processing with remote work stealing [C]//Proc of the 39th Int Conf on Data Engineering (ICDE). Piscataway, NJ: IEEE, 2023: 191−204
[18] Sengupta D, Song S L. EvoGraph: On-the-fly efficient mining of evolving graphs on GPU[C]//Proc of the High Performance Computing: 32nd Int Conf, ISC High Performance 2017. Berlin: Springer, 2017: 97–119
[19] Mariappan M, Vora K. GraphBolt: Dependency-driven synchronous processing of streaming Graphs[C]//Proc of the 14th EuroSys Conf 2019. New York: ACM, 2019: 1−16
[20] Kumar P, Huang H H. GraphOne: A data store for real-time analytics on evolving graphs[C].//Proc of the 17th USENIX Conf on File and Storage Technologies (FAST’19). Berkeley, CA: USENIX Association, 2019: 249−263
[21] Iyer A P, Pu Qifan, Patel K, et al. TEGRA: Efficient ad-hoc analytics on evolving graphs[C]//Proc of the 18th USENIX Symp on Networked Systems Design and Implementation (NSDI’21). Berkeley, CA: USENIX Association, 2021: 337−355
[22] Martella C, Shaposhnik R, Logothetis D, et al. Practical Graph Analytics with Apache Giraph [M]. Berkeley, CA: Apress, 2015
[23] Gonzalez J E, Xin R S, Dave A, et al. GraphX: Graph processing in a distributed dataflow framework[C]//Proc of the 11th USENIX Symp on Operating Systems Design and Implementation. Berkeley, CA: USENIX Association, 2014: 599−613
[24] Fan Wenfei, Xu Jingbo, Wu Yinghui, et al. GRAPE: Parallelizing sequential graph computations[J]. Proceedings of the VLDB Endowment, 2017, 10(12): 1889−1892 doi: 10.14778/3137765.3137801
[25] Yang Ke, Zhang Mingxing, Chen Kang, et al. KnightKing: a fast distributed graph random walk engine[C]//Proc of the 27th ACM Symp on Operating Systems Principles. New York: ACM, 2019: 524–537
[26] Tencent: Plato code [EB/OL]. [2024−01−20].https://github.com/Tencent/plato/
[27] Fan Wenfei, He Tao, Lai Longbin, et al. GraphScope: a unified engine for big graph processing[J]. Proceedings of the VLDB Endowment, 2021, 14(12): 2879−2892 doi: 10.14778/3476311.3476369
[28] Ant Group: TuGraph website [EB/OL]. [2024−01−20].https://www.tugraph.org/
[29] Shi Xuanhua, Zheng Zhigao, Zhou Yongluan, et al. Graph processing on GPUs: A survey[J]. ACM Computing Surveys, 2018, 50(6): 1−35
[30] Huang Jianqiang, Qin Wei, Wang Xiaoying, et al. Survey of external memory large-scale graph processing on a multi-core system[J]. The Journal of Supercomputing, 2020, 76(1): 549−579 doi: 10.1007/s11227-019-03023-0
[31] 王靖,张路,王鹏宇,等. 面向图计算的内存系统优化技术综述[J]. 中国科学:信息科学,2019,49(3):295−313 doi: 10.1360/N112018-00281 Wang Jing, Zhang Lu, Wang Pengyu, et al. Survey on memory system optimization technologies for graph computing[J]. SCIENTIA SINICA Informationis, 2019, 49(3): 295−313 (in Chinese) doi: 10.1360/N112018-00281
[32] Gui Chuangyi, Zheng Long, He Bingsheng, et al. A survey on graph processing accelerators: Challenges and opportunities[J]. Journal of Computer Science Technology, 2019, 34: 339−371 doi: 10.1007/s11390-019-1914-z
[33] 张宇,姜新宇,余辉,等. 图计算体系结构和系统软件关键技术综述[J]. 计算机研究与发展,2024,61(1):20−42 doi: 10.7544/issn1000-1239.202220778 Zhang Yu, Jiang Xinyu, Yu Hui, et al. Research and trend of graph computing architecture and software system[J]. Journal of Computer Research and Development, 2024, 61(1): 20−42 (in Chinese) doi: 10.7544/issn1000-1239.202220778
[34] Heidari S, Simmhan Y, Calheiros R N, et al. Scalable graph processing frameworks: A taxonomy and open challenges[J]. ACM Computing Surveys, 2018, 51(3): 1−53
[35] Sahu S, Mhedhbi A, Salihoglu S, et al. The ubiquity of large graphs and surprising challenges of graph processing[J]. Proceedings of the VLDB Endowment, 2017, 11(4): 420−431 doi: 10.1145/3186728.3164139
[36] Chakrabarti D, Zhan Yiping, Faloutsos C. R-MAT: A recursive model for graph mining[C]//Proc of the 2004 SIAM Int Conf on Data Mining. Philadelphia, PA: SIAM, 2004: 442−446
[37] Takac L, Zabovsky M. Data analysis in public social networks[C]//Proc of the Int scientific Conf and Int workshop present day trends of innovations, 2012, 1−6
[38] Leskovec J, Lang K, Dasgupta A, et al. Community structure in large networks: Natural cluster sizes and the absence of large well-defined clusters[J]. Internet Mathematics, 2008, 6(1): 29−123
[39] Sundaram N, Satish N, Patwary M M A, et al. GraphMat: high performance graph analytics made productive[J]. Proceedings of the VLDB Endowment, 2015, 8(11): 1214−1225 doi: 10.14778/2809974.2809983
[40] Yang C, Buluç A, Owens J D. GraphBLAST: A high-performance linear algebra-based graph framework on the GPU[J]. ACM Transactions on Mathematical Software, 2022, 48(1): 1−51
[41] Hong Changwan, Sukumaran-Rajam A, Kim J, et al. MultiGraph: Efficient graph processing on GPUs[C]//Proc of the 26th Int Conf on Parallel Architectures and Compilation Techniques (PACT). Piscataway, NJ: IEEE, 2017: 27−40
[42] Liu Hang, Huang H H. SIMD-X: Programming and processing of graph algorithms on GPUs[C]//Proc of the 2019 USENIX Annual Technical Conf. Berkeley, CA: USENIX Association, 2019: 411–427
[43] Green O, Bader D A. cuSTINGER: Supporting dynamic graph algorithms for GPUs[C]//Proc of the 2016 IEEE High Performance Extreme Computing Conf (HPEC). Piscataway, NJ: IEEE, 2016: 1−6
[44] Busato F, Green O, Bombieri N, et al. Hornet: An efficient data structure for dynamic sparse graphs and matrices on GPUs[C]//Proc of the 2018 IEEE High Performance Extreme Computing Conf (HPEC). Piscataway, NJ: IEEE, 2018: 1−7
[45] Winter M, Mlakar D, Zayer R, et al. faimGraph: High performance management of fully-dynamic graphs under tight memory constraints on the GPU[C]//Proc of the SC18: Int Conf for High Performance Computing, Networking, Storage and Analysis. Piscataway, NJ: IEEE, 2018: 754−766
[46] Sha Mo, Li Yuchen, He Bingsheng, et al. Accelerating dynamic graph analytics on GPUs[J]. Proceedings of the VLDB Endowment, 2017, 11(1): 107−120 doi: 10.14778/3151113.3151122
[47] Bader D A, Berry J, Kahan S, et al. Graph500 [EB/OL]. [2024−01−20]. http://www.graph500.org
[48] Dong Rongyu, Cao Huawei, Ye Xiaochun, et al. Highly efficient and GPU-friendly implementation of BFS on single-node system[C]//Proc of 2020 IEEE Int Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom). Piscataway, NJ: IEEE, 2020: 544−553
[49] Liu Hang, Huang H H. Enterprise: Breadth-first graph traversal on GPUs[C]//Proc of the Int Conf for High Performance Computing, Networking, Storage and Analysis. New York: ACM, 2015: 1−12
[50] Dijkstra E W. A Note on Two Problems in Connexion with Graphs [M]. Edsger Wybe Dijkstra: His Life, Work, and Legacy. 2022: 287−290
[51] Bellman R. On a routing problem[J]. Quarterly of Applied Mathematics, 1958, 16(1): 87−90 doi: 10.1090/qam/102435
[52] Meyer U, Sanders P. δ-stepping: A parallel single source shortest path algorithm[C]//Proc of the European Symp on algorithms. Berlin: Springer, 1998: 393−404
[53] Wang Kai, Fussell D, Lin Calvin. A fast work-efficient SSSP algorithm for GPUs[C]//Proc of the 26th ACM SIGPLAN Symp on Principles and Practice of Parallel Programming. New York: ACM, 2021: 133–146
[54] Dong Xiaojun, Gu Yan, Sun Yihan, et al. Efficient stepping algorithms and implementations for parallel shortest paths[C]//Proc of the 33rd ACM Symp on Parallelism in Algorithms and Architectures. New York: ACM, 2021: 184–197
[55] Dhulipala L, Blelloch G, Shun Julian. Julienne: A framework for parallel graph algorithms using work-efficient bucketing[C]//Proc of the 29th ACM Symp on Parallelism in Algorithms and Architectures. New York: ACM, 2017: 293–304
[56] Zhang Yuan, Cao Huawei, Zhang Jie, et al. A bucket-aware asynchronous single-source shortest path algorithm on GPU[C]//Proc of the 52nd Int Conf on Parallel Processing Workshops. New York: ACM, 2023: 50–60
[57] Beamer S, Asanović K, Patterson D. Reducing PageRank communication via propagation blocking[C]//Proc of the 2017 IEEE Int Parallel and Distributed Processing Symp (IPDPS). Piscataway, NJ: IEEE, 2017: 820−831
[58] Lakhotia K, Kannan R, Prasanna V. Accelerating PageRank using partition-centric processing[C]//Proc of the 2018 USENIX Annual Technical Conf. Berkeley, CA: USENIX Association, 2018: 427–440
[59] Pandey S, Li X S, Buluc A, et al. H-INDEX: Hash-indexing for parallel triangle counting on GPUs[C]//Proc of the 2019 IEEE High Performance Extreme Computing Conf (HPEC). Piscataway, NJ: IEEE, 2019: 1−7
[60] Yaşar A, Rajamanickam S, Wolf M, et al. Fast triangle counting using Cilk[C]//Proc of the 2018 IEEE High Performance extreme Computing Conf (HPEC). Piscataway, NJ: IEEE, 2018: 1−7
[61] Wasserman S, Faust K. Social network analysis: Methods and applications[J]. American Ethnologist, 1997, 24(1): 219−220 doi: 10.1525/ae.1997.24.1.219
[62] Mislove A, Marcon M, Gummadi K P, et al. Measurement and analysis of online social networks[C]//Proc of the 7th ACM SIGCOMM Conf on Internet Measurement. New York: ACM, 2007: 29−42
[63] Ye Chang, Li Yuchen, He Bingsheng, et al. GPU-accelerated graph label propagation for real-time fraud detection[C]//Proc of the 2021 Int Conf on Management of Data. New York: ACM, 2021: 2348–2356
[64] Jiang Jiaxin, Li Yuan, He Bingsheng, et al. Spade: A real-time fraud detection framework on evolving graphs[J]. Proceedings of the VLDB Endowment, 2022, 16(3): 461−469 doi: 10.14778/3570690.3570696
[65] Alam M A, Faruq M O. Finding shortest path for road network using Dijkstra’s algorithm[J]. Bangladesh Journal of Multidisciplinary Scientific Research, 2019, 1(2): 41−45 doi: 10.46281/bjmsr.v1i2.366
[66] Xiao Yunpeng, He Xi, Yang Chen, et al. Dynamic graph computing: A method of finding companion vehicles from traffic streaming data[J]. Information Sciences, 2022, 591: 128−141 doi: 10.1016/j.ins.2022.01.022
[67] Huang Zan, Chung Wingyan, Chen Hsinchun, et al. A graph model for E-commerce recommender systems[J]. Journal of the American Society for Information Science, 2004, 55(3): 259−274 doi: 10.1002/asi.10372
[68] Mcauley J, Pandey R, Leskovec J. Inferring networks of substitutable and complementary products[C]//Proc of the 21st ACM SIGKDD Int Conf on Knowledge Discovery and Data Mining. New York: ACM, 2015: 785–794
[69] Yang Xiaoyong, Zhu Yadong, Zhang Yi, et al. Large scale product graph construction for recommendation in E-commerce [J]. arXiv preprint, arXiv: 2010.05525, 2020
[70] Afarin M, Gao Chao, Rahman S, et al. CommonGraph: Graph analytics on evolving data[C]//Proc of the 28th ACM Int Conf on Architectural Support for Programming Languages and Operating Systems. New York: ACM, 2023: 133–145
[71] Zou Lei, Zhang Fan, Lin Yinnian, et al. An efficient data structure for dynamic graph on GPUS[J]. IEEE Transactions on Knowledge and Data Engineering, 2023, 35(11): 11051-11066
[72] Gonzalez J E, Low Yucheng, Gu Haijie, et al. PowerGraph: Distributed graph-parallel computation on natural graphs[C]//Proc of the 10th USENIX Conf on Operating Systems Design and Implementation. Berkeley, CA: USENIX Association, 2012: 17–30
[73] Khorasani F, Vora K, Gupta R, et al. CuSha: Vertex-centric graph processing on GPUs[C]//Proc of the 23rd Int Symp on High-Performance Parallel and Distributed Computing. New York: ACM, 2014: 239–252
[74] Zhang Yu, Liao Xiaofei, Jin Hai, et al. DiGraph: An efficient path-based iterative directed graph processing system on multiple GPUs[C]//Proc of the 24th Int Conf on Architectural Support for Programming Languages and Operating Systems. New York: ACM, 2019: 601–614
[75] Zhang Yu, Peng Da, Liao Xiaofei, et al. LargeGraph: An efficient dependency-aware GPU-accelerated large-scale graph processing[J]. ACM Transactions on Architecture and Code Optimization. New York: ACM, 2021, 18(4): 1−24
[76] Roy A, Mihailovic I, Zwaenepoel W. X-Stream: Edge-centric graph processing using streaming partitions[C]//Proc of the 24th ACM Symp on Operating Systems Principles. New York: ACM, 2013: 472–488
[77] Lin Heng, Zhu Xiaowei, Yu Bowen, et al. ShenTu: Processing multi-trillion edge graphs on millions of cores in seconds[C]//Proc of the SC18: Int Conf for High Performance Computing, Networking, Storage and Analysis. Piscataway, NJ: IEEE, 2018: 706−716
[78] Tian Yanyan, Balmin A, Corsten S A, et al. From "think like a vertex" to "think like a graph"[J]. Proceedings of the VLDB Endowment, 2013, 7((3): ): 193−204 doi: 10.14778/2732232.2732238
[79] Kang U, Tsourakakis C E, Faloutsos C. PEGASUS: A peta-scale graph mining system implementation and observations[C]//Proc of the 9th IEEE Int Conf on Data Mining. Piscataway, NJ: IEEE, 2009: 229−238
[80] Vora K, Gupta R, Xu Guoqing. KickStarter: Fast and accurate computations on streaming graphs via trimmed approximations[C]//Proc of the 22d Int Conf on Architectural Support for Programming Languages and Operating Systems. New York: ACM, 2017: 237–251
[81] Jiang Xiaolin, Xu Chengshuo, Yin Xizhe, et al. Tripoline: Generalized incremental graph processing via graph triangle inequality[C]//Proc of the 16th European Conf on Computer Systems. New York: ACM, 2021: 17–32
[82] Islam A a R, Dai Dong. DGAP: Efficient dynamic graph analysis on persistent memory[C]//Proc of the Int Conf for High Performance Computing, Networking, Storage and Analysis. New York: ACM, 2023: 1−13
[83] Sengupta D, Sundaram N, Zhu Xia, et al. GraphIn: An online high performance incremental graph processing framework[C]//Proc of the 22nd Int European Conf on Parallel and Distributed Computing (Euro-Par 2016). Berlin: Springer, 2016: 319−333
[84] Zhu Huanzhou, He Ligang, Leeke M, et al. WolfGraph: The edge-centric graph processing on GPU[J]. Future Generation Computer Systems, 2020, 111: 552−569 doi: 10.1016/j.future.2019.09.052
[85] Sheng Feng, Cao Qiang, Yao Jie. Exploiting buffered updates for fast streaming graph analysis[J]. IEEE Transactions on Computers, 2021, 70(2): 255−269 doi: 10.1109/TC.2020.2987571
[86] Feng Guanyu, Ma Zixuan, Li Daixuan, et al. RisGraph: A real-time streaming system for evolving graphs to support sub-millisecond per-update analysis at millions Ops/s[C]//Proc of the 2021 Int Conf on Management of Data. New York: ACM, 2021: 513–527
[87] Cheng R, Hong Ji, Kyrola A, et al. Kineograph: Taking the pulse of a fast-changing and connected world[C]//Proc of the 7th ACM European Conf on Computer Systems. New York: ACM, 2012: 85–98
[88] Ju Wuyang, Li Jianxin, Yu Weiren, et al. iGraph: An incremental data processing system for dynamic graph[J]. Frontiers of Computer Science, 2016, 10: 462−476 doi: 10.1007/s11704-016-5485-7
[89] Jaiyeoba W, Skadron K. Graphtinker: A high performance data structure for dynamic graph processing[C]//Proc of the 2019 IEEE Int Parallel and Distributed Processing Symp (IPDPS). Piscataway, NJ: IEEE, 2019: 1030−1041
[90] Zhang Yuan, Cao Huawei, Liang Yan, et al. FSGraph: Fast and scalable implementation of graph traversal on GPUs[J]. CCF Transactions on High Performance Computing, 2023, 5(3): 277−291 doi: 10.1007/s42514-023-00155-x
[91] Sengupta D, Song S L, Agarwal K, et al. GraphReduce: Processing large-scale graphs on accelerator-based systems[C]//Proc of the Int Conf for High Performance Computing, Networking, Storage and Analysis. New York: ACM, 2015: 1−12
[92] Xue Rui, Han Haoyu, Zhao Tong, et al. Large-scale graph neural networks: The past and new frontiers[C]//Proc of the 29th ACM SIGKDD Conf on Knowledge Discovery and Data Mining. New York: ACM, 2023: 5835–5836
[93] Ma Lingxiao, Yang Zhi, Miao Youshan, et al. Neugraph: Parallel deep neural network computation on large graphs[C]//Proc of the 2019 USENIX Annual Technical Conf. Berkeley, CA: USENIX Association, 2019: 443–457
[94] Mariappan M, Che J, Vora K. DZiG: Sparsity-aware incremental processing of streaming graphs[C]//Proc of the 16th European Conf on Computer Systems. New York: ACM, 2021: 83–98
[95] Gong Shufeng, Tian Chao, Yin Qiang, et al. Automating incremental graph processing with flexible memoization[J]. Proceedings of the VLDB Endowment, 2021, 14(9): 1613−1625 doi: 10.14778/3461535.3461550
[96] Yu Song, Gong Shufeng, Zhang Yanfeng, et al. Layph: Making change propagation constraint in incremental graph processing by layering graph [J]. arXiv preprint, arXiv: 2304.07458, 2023
[97] Jiang Zihan, Mao Fubing, Guo Yapu, et al. ACGraph: Accelerating streaming graph processing via dependence hierarchy[C]//Proc of the 2023 60th ACM/IEEE Design Automation Conf (DAC). Piscataway, NJ: IEEE, 2023: 1−6
[98] Xu Xianghao, Jiang Hong, Wang Fang, et al. GraphSD: A state and dependency aware out-of-core graph processing system[C]//Proc of the 51st Int Conf on Parallel Processing. New York: ACM, 2023: 1−11
[99] Gera P, Kim H, Sao P, et al. Traversing large graphs on GPUs with unified memory[J]. Proceedings of the VLDB Endowment, 2020, 13(7): 1119−1133 doi: 10.14778/3384345.3384358
[100] Dhulipala L, Blelloch G E, Shun Julian. Low-latency graph streaming using compressed purely-functional trees[C]//Proc of the 40th ACM SIGPLAN Conf on Programming Language Design and Implementation. New York: ACM, 2019: 918–934
[101] Zhu Xiaowei, Feng Guanyu, Serafini M, et al. LiveGraph: A transactional graph storage system with purely sequential adjacency list scans[J]. Proceedings of the VLDB Endowment, 2020, 13(7): 1020−1034 doi: 10.14778/3384345.3384351
[102] Kim J, Dally W J, Scott S, et al. Technology-driven, highly-scalable dragonfly topology[C]//Proc of 2008 Int Symp on Computer Architecture. Piscataway, NJ: IEEE, 2008: 77−88
[103] Min S W, Mailthody V S, Qureshi Z, et al. EMOGI: Efficient memory-access for out-of-memory graph-traversal in GPUs[J]. Proceedings of the VLDB Endowment, 2020, 14(2): 114−127 doi: 10.14778/3425879.3425883
[104] Zhou Shijie, Kannan R, Prasanna V K, et al. HitGraph: High-throughput graph processing framework on FPGA[J]. IEEE Transactions on Parallel and Distributed Systems, 2019, 30(10): 2249−2264 doi: 10.1109/TPDS.2019.2910068
[105] Chen Xinyu, Tan Hongshi, Chen Yao, et al. ThunderGP: HLS-based graph processing framework on FPGAs[C]//Proc of 2021 ACM/SIGDA Int Symp on Field-Programmable Gate Arrays. New York: ACM, 2021: 69–80
[106] Dai Guohao, Huang Tianhao, Wang Yu, et al. NewGraph: Balanced large-scale graph processing on FPGAs with low preprocessing overheads[C]//Proc of 2018 IEEE 26th Annual Int Symp on Field-Programmable Custom Computing Machines (FCCM). Piscataway, NJ: IEEE, 2018: 208−208
[107] Dai Guohao, Huang Tianhao, Chi Yuze, et al. ForeGraph: Exploring large-scale graph processing on multi-FPGA architecture[C]//Proc of 2017 ACM/SIGDA Int Symp on Field-Programmable Gate Arrays. New York: ACM, 2017: 217–226
[108] Chen Xinyu, Chen Yao, Cheng Feng, et al. ReGraph: Scaling graph processing on HBM-enabled FPGAs with heterogeneous pipelines[C]//Proc of the 55th IEEE/ACM Int Symp on Microarchitecture (MICRO). Piscataway, NJ: IEEE, 2022: 1342−1358
[109] Wang Qinggang, Zheng Long, Huang Yu, et al. GraSU: A fast graph update library for FPGA-based dynamic graph processing[C]//Proc of 2021 ACM/SIGDA Int Symp on Field-Programmable Gate Arrays. New York: ACM, 2021: 149–159
[110] Ham T J, Wu Lisa, Sundaram N, et al. Graphicionado: A high-performance and energy-efficient accelerator for graph analytics[C]//Proc of the 49th Annual IEEE/ACM Int Symp on Microarchitecture (MICRO). Piscataway, NJ: IEEE, 2016: 1−13
[111] Yan Mingyu, Hu Xing, Li Shuangchen, et al. Alleviating irregularity in graph analytics acceleration: A hardware/software co-design approach[C]//Proc of the 52nd Annual IEEE/ACM Int Symp on Microarchitecture. New York: ACM, 2019: 615–628
[112] Dadu V, Liu Sihao, Nowatzki T. PolyGraph: Exposing the value of flexibility for graph processing accelerators[C]//Proc of the 48th Annual Int Symp on Computer Architecture. Piscataway, NJ: IEEE, 2021: 595–608
[113] Yang Yifan, Li Zhaoshi, Deng Yangdong, et al. GraphABCD: Scaling out graph analytics with asynchronous block coordinate descent[C]//Proc of the ACM/IEEE 47th Annual Int Symp on Computer Architecture. Piscataway, NJ: IEEE, 2020: 419–432
[114] Rahman S, Abu-Ghazaleh N, Gupta R. GraphPulse: An event-driven hardware accelerator for asynchronous graph processing[C]//Proc of the 53rd Annual IEEE/ACM Int Symp on Microarchitecture (MICRO). Piscataway, NJ: IEEE, 2020: 908−921
[115] Rahman S, Afarin M, Abu-Ghazaleh N, et al. JetStream: Graph analytics on streaming data with event-driven hardware accelerator[C]//Proc of the 54th Annual IEEE/ACM Int Symp on Microarchitecture. New York: ACM, 2021: 1091–1105
[116] Zhang Yu, Liao Xiaofei, Jin Hai, et al. DepGraph: A dependency-driven accelerator for efficient iterative graph processing[C]//Proc of the 2021 IEEE Int Symp on High-Performance Computer Architecture (HPCA). Piscataway, NJ: IEEE, 2021: 371−384
[117] Zhang Kaiyuan, Chen Rong, Chen Haibo. NUMA-aware graph-structured analytics[C]//Proc of the 20th ACM SIGPLAN Symp on Principles and Practice of Parallel Programming. New York: ACM, 2015: 183–193
[118] Zhang Yunming, Kiriansky V, Mendis C, et al. Making caches work for graph analytics[C]//Proc of the 2017 IEEE Int Conf on Big Data (Big Data). Piscataway, NJ: IEEE, 2017: 293−302
[119] Shun Julian, Dhulipala L, Blelloch G E. Smaller and faster: Parallel processing of compressed graphs with Ligra+[C]//Proc of the 2015 Data Compression Conf. Piscataway, NJ: IEEE, 2015: 403−412
[120] Vora K. LUMOS: Dependency-driven disk-based graph processing[C]//Proc of the 2019 USENIX Conf on USENIX Annual Technical Conf. Berkeley, CA: USENIX Association, 2019: 429–442
[121] Cheng Jiefeng, Liu Qin, Li Zhenguo, et al. VENUS: Vertex-centric streamlined graph computation on a single PC[C] //Proc of the 2015 IEEE 31st Int Conf on Data Engineering. Piscataway, NJ: IEEE, 2015: 1131−1142
[122] Ai Zhiyuan, Zhang Mingxing, Wu Yongwei, et al. Squeezing out all the value of loaded data: An out-of-core graph processing system with reduced disk I/O[C]//Proc of the 2017 USENIX Annual Technical Conf. Berkeley, CA: USENIX Association, 2017: 125–137
[123] Matam K K, Hashemi H, Annavaram M. MultiLogVC: Efficient out-of-core graph processing framework for flash storage[C]//Proc of the 2021 IEEE Int Parallel and Distributed Processing Symp (IPDPS). Piscataway, NJ: IEEE, 2021: 245−255
[124] Chen Rong, Shi Jiaxin, Chen Yanzhe, et al. PowerLyra: Differentiated graph computation and partitioning on skewed graphs[J]. ACM Transactions on Parallel Computing, 2019, 5(3): 1−39
[125] Zhong Jianlong, He Bingsheng. Medusa: Simplified graph processing on GPUs[J]. IEEE Transactions on Parallel and Distributed Systems, 2014, 25(6): 1543−1552 doi: 10.1109/TPDS.2013.111
[126] Meng Ke, Li Jiajia, Tan Guangming, et al. A pattern based algorithmic autotuner for graph processing on GPUs[C]//Proc of the 24th Symp on Principles and Practice of Parallel Programming. New York: ACM, 2019: 201–213
[127] Pai S, Pingali K. A compiler for throughput optimization of graph algorithms on GPUs[C]//Proc of the 2016 ACM SIGPLAN Int Conf on Object-Oriented Programming, Systems, Languages, and Applications. New York: ACM, 2016: 1–19
[128] Gharaibeh A, Santos-Neto E, Costa L, et al. Efficient large-scale graph processing on hybrid CPU and GPU systems [J]. arXiv preprint, arXiv: 1312.3018, 2013
[129] Gharaibeh A, Costa L B, Santos-Neto E, et al. A yoke of oxen and a thousand chickens for heavy lifting graph processing[C]//Proc of the 21st Int Conf on Parallel Architectures and Compilation Techniques. New York: ACM, 2012: 345–354
[130] Han Wei, Mawhirter D, Wu Bo, et al. Graphie: Large-scale asynchronous graph traversals on just a GPU[C]//Proc of 2017 26th Int Conf on Parallel Architectures and Compilation Techniques (PACT). Piscataway, NJ: IEEE, 2017: 233−245
[131] Kim M S, An K, Park H, et al. GTS: A fast and scalable graph processing method based on streaming topology to GPUs[C]//Proc of the 2016 Int Conf on Management of Data. New York: ACM, 2016: 447–461
[132] Ma Lingxiao, Yang Zhi, Chen Han, et al. Garaph: Efficient GPU-accelerated graph processing on a single machine with balanced replication[C]//Proc of the 2017 USENIX Conf on Usenix Annual Technical Conf. Berkeley, CA: USENIX Association, 2017: 195–207
[133] Wang Pengyu, Wang Jing, Li Chao, et al. Grus: Toward unified-memory-efficient high-performance graph processing on GPU[J]. ACM Transactions on Architecture and Code Optimization, 2021, 18(2): 1−25
[134] Jia Zhihao, Kwon Y, Shipman G, et al. A distributed multi-GPU system for fast graph processing[J]. Proceedings of the VLDB Endowment, 2017, 11(3): 297−310 doi: 10.14778/3157794.3157799
[135] Dathathri R, Gill G, Hoang L, et al. Gluon: A communication-optimizing substrate for distributed heterogeneous graph analytics[C]//Proc of the 39th ACM SIGPLAN Conf on Programming Language Design and Implementation. New York: ACM, 2018: 752–768
[136] Ji Yuede, Liu Hang, Huang H H. SWARMGRAPH: Analyzing large-scale in-memory graphs on GPUs[C]//Proc of the 2020 IEEE 22nd Int Conf on High Performance Computing and Communications; IEEE 18th Int Conf on Smart City; IEEE 6th Int Conf on Data Science and Systems (HPCC/SmartCity/DSS). Piscataway, NJ: IEEE, 2020: 52−59
[137] Brahmakshatriya A, Zhang Yunming, Hong Changwan, et al. Compiling graph applications for GPUs with GraphIt[C]//Proc of the 2021 IEEE/ACM Int Symp on Code Generation and Optimization (CGO). Piscataway, NJ: IEEE, 2021: 248−261
[138] Yuan Pingpeng, Xie Changfeng, Liu Ling, et al. PathGraph: A path centric graph processing system[J]. IEEE Transactions on Parallel and Distributed Systems. Piscataway, NJ: IEEE, 2016, 27(10): 2998−3012 doi: 10.1109/TPDS.2016.2518664
[139] Chi Yuze, Dai Guohao, Wang Yu, et al. NXGraph: An efficient graph processing system on a single machine[C]//Proc of the 2016 IEEE 32nd Int Conf on Data Engineering (ICDE). Piscataway, NJ: IEEE, 2016: 409−420
[140] Xu Xianghao, Wang Fang, Jiang Hong, et al. A hybrid update strategy for I/O-efficient out-of-core graph processing[J]. IEEE Transactions on Parallel and Distributed Systems, 2020, 31(8): 1767−1782 doi: 10.1109/TPDS.2020.2973143
[141] Zhang Mingxing, Wu Yongwei, Zhuo Youwei, et al. Wonderland: A novel abstraction-based out-of-core graph processing system[C]//Proc of the 23rd Int Conf on Architectural Support for Programming Languages and Operating Systems. New York: ACM, 2018: 608–621
[142] Zheng Long, Li Xianliang, Zheng Yaohui, et al. Scaph: Scalable GPU-accelerated graph processing with value-driven differential scheduling[C]//Proc of the 2020 USENIX Annual Technical Conf. Berkeley, CA: USENIX Association, 2020: 573−588
[143] Tang Ruiqi, Zhao Ziyi, Wang Kailun, et al. Ascetic: Enhancing cross-iterations data efficiency in out-of-memory graph processing on GPUs[C]//Proc of the 50th Int Conf on Parallel Processing. New York: ACM, 2021: 1−10
[144] Vora K, Xu Guoqing, Gupta R. Load the edges you need: a generic I/O optimization for disk-based graph processing[C]//Proc of the 2016 USENIX Conf on Usenix Annual Technical Conf. Berkeley, CA: USENIX Association, 2016: 507–522
[145] Pan Zhenxuan, Wu Tao, Zhao Qingwen, et al. GeaFlow: A graph extended and accelerated dataflow system[C]//Proc of the ACM on Management of Data. New York: ACM, 2023, 1(2): 1−27
[146] Ediger D, Mccoll R, Riedy J, et al. STINGER: High performance data structure for streaming graphs[C]//Proc of the 2012 IEEE Conf on High Performance Extreme Computing. Piscataway, NJ: IEEE, 2012: 1−5
[147] Macko P, Marathe V J, Margo D W, et al. LLAMA: Efficient graph analytics using large multiversioned arrays[C]//Proc of the 2015 IEEE 31st Int Conf on Data Engineering. Piscataway, NJ: IEEE, 2015: 363−374
[148] Leo D D, Boncz P. Teseo and the analysis of structural dynamic graphs[J]. Proceedings of the VLDB Endowment, 2021, 14(6): 1053−1066 doi: 10.14778/3447689.3447708
[149] Basak A, Lin Jilan, Lorica R, et al. SAGA-Bench: Software and hardware characterization of streaming graph analytics workloads[C]//Proc of the 2020 IEEE Int Symp on Performance Analysis of Systems and Software (ISPASS). Piscataway, NJ: IEEE, 2020: 12−23
[150] Islam A A R, Dai Dong, Cheng Dazhao. VCSR: Mutable CSR graph format using vertex-centric packed memory array[C]//Proc of the 2022 22nd IEEE Int Symp on Cluster, Cloud and Internet Computing (CCGrid). Piscataway, NJ: IEEE, 2022: 71−80
[151] Wang Rui, He Shuibing, Zong Weixu, et al. XPGraph: XPline-friendly persistent memory graph stores for large-scale evolving graphs[C]//Proc of the 2022 55th IEEE/ACM Int Symp on Microarchitecture (MICRO). Piscataway, NJ: IEEE, 2022: 1308−1325
[152] Chen Hongtao, Zhang Mingxing, Yang Ke, et al. Achieving sub-second pairwise query over evolving graphs[C]//Proc of the 28th ACM Int Conf on Architectural Support for Programming Languages and Operating Systems. New York: ACM, 2023: 1−15
[153] Shen Sijie, Yao Zihang, Shi Lin, et al. Bridging the gap between relational OLTP and graph-based OLAP[C]//Proc of the 2023 USENIX Annual Technical Conf (USENIX ATC 23). Berkeley, CA: USENIX Association, 2023: 181−196
[154] Vaziri P, Vora K. Controlling memory footprint of stateful streaming graph processing[C]//Proc of the 2021 USENIX Annual Technical Conf (USENIX ATC 21). Berkeley, CA: USENIX Association, 2021: 269−283
[155] 张承龙,曹华伟,王国波,等. 面向高通量计算机的图算法优化技术[J]. 计算机研究与发展,2020,57(6):1152−1163 doi: 10.7544/issn1000-1239.2020.20200115 Zhang Chenglong, Cao Huawei, Wang Guobo, et al. Efficient optimization of graph computing on high-throughput computer[J]. Journal of Computer Research and Development, 2020, 57(6): 1152−1163 (in Chinese) doi: 10.7544/issn1000-1239.2020.20200115
[156] Sha Mo, Li Yuchen, Tan K L. GPU-based graph traversal on compressed graphs[C]//Proc of the 2019 Int Conf on Management of Data. New York: ACM, 2019: 775–792
[157] Chen Zheng, Zhang Feng, Guan Jiawei, et al. CompressGraph: Efficient parallel graph analytics with rule-based compression [J] //Proc of the ACM on Management of Data. New York: ACM, 2023, 1: 1−31
[158] Lee E, Kim J, Lim K, et al. Pre-select static caching and neighborhood ordering for BFS-like algorithms on disk-based graph engines[C]//Proc of the 2019 USENIX Annual Technical Conf. Berkeley, CA: USENIX Association, 2019: 459–473
[159] Yang T Y, Liang Yuhong, Yang M C. Practicably boosting the processing performance of BFS-like algorithms on semi-external graph system via I/O-efficient graph ordering[C]//Proc of the 20th USENIX Conf on File and Storage Technologies (FAST 22). Berkeley, CA: USENIX Association, 2022: 381−396
[160] Wei Hao, Yu J X, Lu Can, et al. Speedup graph processing by graph ordering[C]//Proc of the 2016 Int Conf on Management of Data. New York: ACM, 2016: 1813–1828
[161] Balaji V, Lucia B. When is graph reordering an optimization? studying the effect of lightweight graph reordering across applications and input graphs[C]//Proc of the 2018 IEEE Int Symp on Workload Characterization (IISWC). Piscataway, NJ: IEEE, 2018: 203−214
[162] Guan Nan, Stigge M, Yi Wang, et al. Cache-aware scheduling and analysis for multicores[C]//Proc of the 7th ACM Int Conf on Embedded software. New York: ACM, 2009: 245–254
[163] Han W S, Lee S, Park K, et al. TurboGraph: A fast parallel graph engine handling billion-scale graphs in a single PC[C]//Proc of the 19th ACM SIGKDD Int Conf on Knowledge Discovery and Data Mining. New York: ACM, 2013: 77–85
[164] Lin Shuai, Wang Rui, Li Yongkun, et al. Towards fast large-scale graph analysis via two-dimensional balanced partitioning[C]//Proc of the 51st Int Conf on Parallel Processing. New York: ACM, 2023: 1−11
[165] Gong Shufeng, Zhang Yanfeng, Yu Ge. HBP: Hotness balanced partition for prioritized iterative graph computations[C]//Proc of the 2020 IEEE 36th Int Conf on Data Engineering (ICDE). Piscataway, NJ: IEEE, 2020: 1942−1945
[166] NVIDIA. Unified memory for cuda beginners [EB/OL]. [2024−01−20].https://developer.nvidia.com/blog/unified-memory-cuda-beginners/
[167] NVIDIAL. Nvlink and nvswitch [EB/OL]. [2024−01−20].https://www.nvidia.com/en-us/data-center/nvlink/
[168] Lin Yen, Jeng J Y, Liu Y Y, et al. A review of PCI express protocol-based systems in response to 5G application demand[J]. Electronics, 2022, 11((5): ): 678 doi: 10.3390/electronics11050678
[169] Wang Jing, Li Chao, Liu Yibo, et al. Fargraph+: Excavating the parallelism of graph processing workload on RDMA-based far memory system[J]. Journal of Parallel and Distributed Computing, 2023, 177: 144−159 doi: 10.1016/j.jpdc.2023.02.015
[170] Wang Jing, Li Chao, Wang Taolei, et al. Excavating the potential of graph workload on RDMA-based far memory architecture[C]//Proc of the 2022 IEEE Int Parallel and Distributed Processing Symp (IPDPS). Piscataway, NJ: IEEE, 2022: 1029−1039
[171] 崔鹏杰,袁野,李岑浩,等. RGraph:基于RDMA的高效分布式图数据处理系统[J]. 软件学报,2022,33(3):1018−1042 Cui Pengjie, Yuan Ye, Li Cenhao, et al. RGraph: Effective distributed graph data processing system based on RDMA[J]. Journal of Software, 2022, 33(3): 1018−1042 (in Chinese)
[172] Brock B, Buluç A, Yelick K J a P A. RDMA-based algorithms for sparse matrix multiplication on GPUs [J]. arXiv preprint, arXiv: 2311.18141, 2023
[173] Kwak H, Lee C, Park H, et al. What is Twitter, a social network or a news media?[C]//Proc of the 19th Int Conf on World Wide Web. New York: ACM, 2010: 591–600
[174] Boldi P, Santini M, Vigna S. Laboratory for web algorithmics [EB/OL]. [2024−01−20]. http://law.di.unimi.it/
[175] Rossi R, Ahmed N. The network data repository with interactive graph analytics and visualization [J] //Proc of the AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2015, 29: 1−2
[176] Yang J, Leskovec J. Defining and evaluating network communities based on ground-truth[C]//Proc of the ACM SIGKDD Workshop on Mining Data Semantics. New York: ACM, 2012: 1−8
[177] Leskovec J, Sosič R. Snap: A general-purpose network analysis and graph-mining library[J]. ACM Transactions on Intelligent Systems and Technology, 2016, 8(1): 1−20
[178] Boldi P, Vigna S. The webgraph framework I: Compression techniques[C]//Proc of the 13th Int Conf on World Wide Web. New York: ACM, 2004: 595–602
[179] Boldi P, Rosa M, Santini M, et al. Layered label propagation: A multiresolution coordinate-free ordering for compressing social networks[C]//Proc of the 20th Int Conf on World Wide Web. New York: ACM, 2011: 587–596