-
摘要:
随着大模型技术的迅猛发展,大模型在自然语言处理和计算机视觉等领域表现出卓越的性能,成为解决复杂问题的重要工具,并在科研和产业界引发了广泛关注. 然而,当前基于云平台的大模型训练和推理方案面临诸多挑战,包括高昂的成本、有限的可扩展性和信息安全风险等. 随着模型参数规模的不断扩大,对于低成本、高效训练和推理的需求愈发迫切. 在端边侧进行大模型的协同训练和推理,可以显著降低延迟和带宽需求,同时增强数据隐私和操作效率,为大模型在多样化场景中的低成本应用提供关键技术支持,成为当前研究的热点之一. 全面调研了面向边缘智能的大模型相关研究,主要从大模型边缘训练和推理2个角度对当前相关研究进行了深入分析和讨论. 最后,提出了面向边缘智能的大模型技术发展所面临的挑战和未来展望. 希望能促进学术界和产业界对面向边缘智能的大模型技术有更深入了解和关注,并能够启发更多的学者开展深入研究.
Abstract:With the rapid development of large-scale model technology, these models have exhibited remarkable performance in fields such as natural language processing and computer vision, becoming essential tools for addressing complex issues and drawing significant interest from both the scientific community and the industry. Nonetheless, current cloud-platform-based schemes for training and inference of large models face multiple challenges, including high expenses, restricted scalability, and information security risks. As the scale of model parameters expands continually, the need for low-cost, efficient training and inference methods grows ever more pressing. Carrying out collaborative training and inference of large models on edge devices can dramatically decrease latency and bandwidth demands, concurrently reinforcing data privacy and operational efficiency. This strategy furnishes vital technological support for the economical deployment of large models across a variety of contexts, thereby evolving into one of the prominent research hotspots. This article conducts a thorough investigation of research pertinent to large models in the context of edge intelligence, with an in-depth analysis and discourse primarily focused on two aspects: edge-based training and inference of large models. Ultimately, it outlines the challenges confronted in the progression of large model technologies tailored for edge intelligence and delineates future prospects. The ambition is to stimulate a heightened comprehension and intensified attention from both academic and industrial sectors towards technologies involving large models for edge intelligence, thereby encouraging further scholarly exploration in this thriving domain.
-
存储模块与计算模块相分离的冯·诺伊曼体系结构存在“存储墙”问题[1],严重制约了处理器的性能提升,并伴随着较大的能量消耗. 为彻底突破该“瓶颈”,需要在体系结构层次上进行突破,研究新型存内计算架构[2-4]. 忆阻状态逻辑为存内计算提供了电路基础. 通过融合布尔逻辑和非易失性存储的功能,忆阻状态逻辑可以消除计算过程中的数据移动(消除访存延时和能耗),实现存储与计算的细粒度融合. 目前,通过理论推导(仿真)和实测实验已经有诸如IMP,FALSE,NOR等多个忆阻状态逻辑门得到验证,功能覆盖布尔逻辑完备集,为实现复杂逻辑计算提供了可行基础. 然而,从复杂计算功能到忆阻存储阵列内状态逻辑门级联序列转换的自动化设计研究仍处于萌芽阶段,一些挑战亟待解决.
一是“门单元类型单一”. 目前大多数针对复杂计算功能状态逻辑实现的研究都集中在使用功能完备的单个状态逻辑门级联,如IMP,NOR,NAND等,缺少对多个兼容状态逻辑门的使用,极大地限制了复杂状态逻辑计算过程的优化空间. 在综合策略中加入多种状态逻辑门,能够为复杂逻辑功能的实现提供更多的基本功能选择,有效减少最终的映射规模、操作数量以及执行延迟. 因此,有必要探索面向存内计算的多状态逻辑门综合映射方法.
二是“综合映射目标单一”. 当前大多数面向复杂计算功能的状态逻辑实现的研究皆以计算延时为优化目标,鲜有针对其他设计目标的探讨. 忆阻状态逻辑门实际是在外加电压控制下的“条件写”过程,根据相应的输入数据,门的成功执行必然伴随一次擦写过程(状态翻转). 在当前的工艺成熟度下,实际忆阻器产品的可擦写次数相较于传统的动态随机存取存储器(DRAM)和静态随机存取存储器(SRAM)仍有不足. 在状态逻辑计算过程中,多次擦写可能导致器件磨损而失效,在设备维修维护不便的边缘计算场景下值得重点关注. 因此,有必要研究减少状态逻辑计算过程中器件磨损的方法来提高边缘计算设备的使用寿命,从而降低维护和更换成本.
针对以上2个问题,本文研究面向低磨损存内计算的多状态逻辑门综合,探索采用多种状态逻辑门的综合映射过程来降低复杂存内状态逻辑计算过程的翻转率.
1. 相关工作
1.1 状态逻辑门
沿用CMOS逻辑中门的定义,状态逻辑门是用来表示基于忆阻器电路和相应的逻辑功能的概念[5]. 状态逻辑门操作过程中信息的载体为忆阻器的电阻状态,逻辑操作的过程即为各个忆阻器在电压激励下的条件转变过程. 这种由电阻代表逻辑信息,通过电阻变化过程来映射逻辑函数的逻辑实现体系被称为状态逻辑[6].
通常,用于构建状态逻辑门的忆阻器具有2个稳定的电阻状态,分别为高电阻状态(HRS,通常定义为逻辑0)和低电阻状态(LRS,通常定义为逻辑1). 对于双极型忆阻器,高电阻状态和低电阻状态之间的转换是由极性相反的外部工作电压触发的,它们被称为SET和RESET电压[7]. 当让高电阻状态和低电阻状态分别表示数字信号0和1时,对忆阻器构成的电路施加一组特定的电压信号序列,忆阻器的状态信息会相应地发生改变,这样就可以在忆阻器初始状态和最终状态之间映射一个逻辑函数. 状态逻辑门通过外围电路施加的控制电压信号触发“条件写”过程,在操作后的输出器件与输入器件之间映射逻辑关系,完成布尔逻辑功能.
目前的研究工作已经通过改变电路结构或者针对同种电路结构来改变控制电压信号大小的方法实现了多种不同状态逻辑门,其逻辑功能覆盖IMP,NOR,NAND,NOT等完备集合. 本文沿用Xu等人[5]使用的状态逻辑门命名规则,使用符号“结构-N-n功能”表示一种状态逻辑门. 其中“结构”表示电路连接的类型,“N”表示逻辑输入的数量,“n”表示所包含的忆阻器件的数量,“功能”表示实现的逻辑功能. 图1中给出了2种典型的实现NOR功能的状态逻辑门结构,分别为“PMR-two-3NOR”和“PMASM-two-3NOR”.
为方便描述,本文使用简单门(simple gate)和复合门(composite gate)来指代状态逻辑门[8]. 其中,简单门是指逻辑输入和输出映射到不同忆阻器件的逻辑门,如图1中的“PMR-two-3NOR”和“PMASM-two-3NOR”. 在这类逻辑门中,逻辑输出器件的初始状态一般设置为常数0或者1,状态逻辑门的执行过程就是对输出器件的“条件写”过程,逻辑操作完成后,将产生逻辑输出并保存在逻辑输出器件中. 复合门是指逻辑输出与其中一个逻辑输入共用同一个器件的状态逻辑门,其可由简单门扩展而来. 若将简单门的输出器件的初始值不设为常数0或1,而是作为第3个逻辑输入,那么简单门的逻辑功能将会扩展为由原始简单门功能与第3个逻辑输入的“或”(原始简单门基于输出器件的“条件置位”构建)或者“与”(原始简单门基于输出器件的“条件复位”构建). 例如,对于图1(a)中的“PMR-two-3NOR”门,若将其输出器件的初始值作为第3个逻辑输入,那么其实现的逻辑功能为ONOR(¯P+Q+Y),我们将此复合门命名为“PMR-three-3ONOR”;类似地,对于图1(b)中的“PMASM-two-3NOR”门,其实现的复合门为“PMASM-three-3ANOR”(¯P+Q⋅Y). 由此,我们看到,每种简单状态逻辑门都对应一个由其扩展而来的复合门. 复合门的功能为对应的简单门的功能与“或逻辑”或者“与逻辑”的级联. 因此,复合门的功能可以拆分成“简单门的功能 + 或逻辑(与逻辑)”;反过来,“简单门的功能 + 或逻辑(与逻辑)”可以构成复合门的功能. 简单门和复合门的相互转化为本文后续逻辑网表后处理的理论基础.
1.2 复杂逻辑的级联实现
仅仅采用状态逻辑门完成单步逻辑无法满足实现复杂计算,复杂计算过程的执行需要对状态逻辑门进行灵活级联,即将前一级状态逻辑门的输出连接到下一级状态逻辑门的输入. 与CMOS逻辑门的级联不同,状态逻辑门中的逻辑信息由电阻状态表示,逻辑信息不能通过物理金属线进行传输. 然而,归功于忆阻交叉阵列的灵活性,忆阻状态逻辑门可以通过在忆阻交叉阵列中对器件的交叠使用来实现门的级联.
依托忆阻交叉阵列完成状态逻辑门级联的前提条件是状态逻辑门要能在忆阻交叉阵列中灵活配置. 通过前期调研发现,可级联状态逻辑门主要为3种电路结构:PMR[9],PMASM[10],APMR[11]. 其中,前2种电路结构适合在2维忆阻交叉阵列中配置,如图2所示. 第3种电路结构适合在3维忆阻交叉阵列的层间器件之间配置.
对于可在忆阻阵列中灵活配置的状态逻辑门,可以通过时空级联的方式实现复杂的逻辑计算[12]. 图3展示了在2维忆阻交叉阵列中级联2个“PMR-two-3NOR”状态逻辑门的步骤. 可以看到,状态逻辑门的级联不仅需要协调忆阻阵列中的忆阻器单元(空间维度),还需要按顺序触发这些门(时间维度). 因此,依托忆阻交叉阵列完成复杂状态逻辑计算的过程就是通过施加操作电压序列将一个个状态逻辑门配置到忆阻交叉阵列中完成逻辑门功能的级联过程.
尽管存内状态逻辑计算系统消除了数据加载和存储的过程,但其时空级联特性使得状态逻辑在计算过程本身上难以超越CMOS组合逻辑电路[5,11]. 然而,根据研究表明,在忆阻交叉阵列中通过并行执行多个计算实例可以弥补这一弱点[13-15]. 但是,在较少步骤内实现复杂计算功能的状态逻辑操作仍然值得深入研究,因为它直接关系到存内状态逻辑计算系统的效率. 而针对复杂计算过程,如何准确找到最优状态逻辑门级联序列就是状态逻辑综合映射问题.
在状态逻辑研究的早期阶段,大多研究工作都是通过手动设计状态逻辑门的级联序列来实现复杂计算实例. 例如,Talati等人[10]使用“PMASM-two-3NOR”逻辑门,通过12步级联实现了1位全加器操作;Adam等人[16]使用“PMR-two-2IMP”“APMR-two-2IMP”逻辑门级联实现了3维忆阻阵列中的1位全加器,其中由于涉及许多读取和写入操作,级联需要35个步骤;Huang等人[9]使用“PMR-two-3NAND”状态逻辑门,通过10步级联实现了1位全加法器;Xu等人[11]级联6个APMR状态逻辑门,通过14步级联完成1位全加器的操作;Sun等人[17]采用多输入复合状态逻辑门将1位全加器的实现步数减小为2步. 这种手动设计的策略能够应用于小规模电路功能的固定设计,但对于复杂的大规模电路功能,手动设计是耗时并且容易出错的. 因此,需要研发自动化的逻辑综合工具,以实现复杂计算.
目前,研究者们已经开发了多种状态逻辑综合工具,能够在忆阻交叉阵列中以较少的时间成本(或其他目标)找到实现复杂逻辑功能的状态逻辑门执行序列[8,15,18-26]. 对状态逻辑自动化综合工具的研究主要分为2个阶段:
第1个阶段考虑复杂计算功能到状态逻辑门功能的分解,并未过多地考虑阵列映射的约束[27-30]. 如Chakraborti等人[27]提出了一种采用忆阻器有效实现2-1多路复用器的方法,并提出了一种综合方法,该方法将给定的布尔逻辑表示为简化有序二进制决策图(ROBDD);Chattopadhyay等人[28]对传统的综合算法进行了扩展,提出了新的启发式算法. 此外,Bürger等人[18]提出了一种使用面向CMOS的综合工具(如ABC工具)将复杂逻辑功能分解为状态逻辑门的基本功能的方法.
第2个阶段是利用CMOS综合工具自动化地完成逻辑综合,然后考虑阵列约束,完成状态逻辑门到阵列的位置映射. 该阶段的研究大体又可以分为2类:一类是面向全阵列范围映射的状态逻辑综合映射工作,此类工作主要以状态逻辑门数最少和最大限度门级并行为综合映射的优化目标,少数工作探讨了面积(器件个数)的约束. 例如,Hur等人[19]提出了一种通用的综合映射流程(SIMPLE MAGIC),该流程使用了“PMASM-two-3NOR”“PMASM-one-2NOT”逻辑门,面向全阵列范围优化状态逻辑门的执行,并考虑所涉及的阵列约束.Bhattacharjee等人[21]提出的综合映射流程(CONTRA)使用基于查找表(LUT)的输入函数映射到忆阻交叉阵列上,最大限度地实现并行操作,并使用一种新的搜索技术在忆阻交叉阵列内以最佳方式移动数据. 然而,面向全阵列范围的综合映射方法多数需要依赖求解器遍历求解空间,是一个耗时的过程,从降低求解时间的角度考虑,出现了另一类面向单行/列映射的综合映射方法. 这类方法的代表性工作为Hur等人[15]提出的改进的自动综合和映射方法,称为SIMPLER MAGIC.该综合映射流程的优化目标是从以前的最小延迟(操作步骤数)转换为最小面积(使用的器件数量),在需要时重用单元以节省面积[5].
可以看到,当前的多数状态逻辑综合映射工作的研究皆以计算延时为优化目标,少数工作针对阵列面积(器件数目)的约束进行了讨论,鲜有针对其他目标,特别是器件磨损的研究和探讨. 因此,本文以降低状态逻辑计算过程的器件磨损为目标,探索新的状态逻辑综合映射方法,提高边缘计算设备的使用寿命,从而降低维护和更换成本.
2. 状态逻辑门的兼容性验证
如1.2节所述,复杂状态逻辑计算过程的自动化设计实现需要2个方面的内容:一是有功能完备且可在同一忆阻交叉阵列中配置的状态逻辑门;二是有自动化综合映射方法的支持. 因此,在进行状态逻辑综合映射研究之前,首先需要对综合映射中所要使用的多种状态逻辑门进行功能和兼容性验证,保证逻辑功能的正确性和阵列可配置性. 本文使用SPICE电路仿真工具对6种PMR结构的状态逻辑门进行功能验证,包括COPY,NOT,NOR,OR共4种简单门和IMP,ONOR共2种复合门. 验证过程中使用Stanford大学的开源ReRAM器件模型(metal oxide resistive random access memory Verilog-A models, Version 1.0.0)[31],仿真使用的器件参数如表1所示.
表 1 器件参数Table 1. Device Parameters参数 解释 默认值 T_ini /K 温度 298 F_min /(V/m) 促进隧穿间隙形成的最小场强 1.4E9 Tox /nm 氧化层厚度 12 gap_ini /nm 初始隧穿间隙 1.8 gap_min /nm 最小隧穿间隙 0.2 gap_max /nm 最大隧穿间隙 1.8 在所采用的忆阻器模型中,离子和空穴迁移的复杂过程被简化为1维导电细丝的生长/溶解,并保留了基本的转变物理特性. 隧穿间隙(gap distance)的大小,即导电细丝尖端与顶部电极之间的距离,是决定器件电阻的主要变量[31]. 因此,在实际功能验证过程中,通过设置间隙距离将器件的初始状态设置为高阻态(HRS)或低阻态(LRS),选择1.7 nm的间隙距离所对应的电阻状态作为HRS,0.3 nm的间隙距离所对应的电阻状态作为LRS.通过尝试,我们取置位电压为 1.4 V,复位电压为 −1.0 V,作为状态逻辑门操作条件的求解参数.
根据1.1节所使用的状态逻辑门命名规则,本文所使用的6个状态逻辑门可分为4类:第1类是“PMR-one-2x”,包括“PMR-one-2NOT”“PMR-one-2COPY”;第2类是“PMR-two-2x”,包括“PMR-two-2IMP”;第3类是“PMR-two-3x”,包括“PMR-two-3NOR”“PMR-two-3OR”;第4类是“PMR-three-3x”,包括“PMR-three-3ONOR”. 在上述分类中,第1,2类状态逻辑门的电路结构相同,如图4(a)所示. 其中,“PMR-two-2IMP”是由“PMR-one-2NOT”扩展而来的复合门;第3,4类状态逻辑门的电路结构相同,如图4(b)所示,且“PMR-three-3ONOR”是由“PMR-two-3NOR”扩展而来的复合门. 以下分2个小节对上述4类状态逻辑门的仿真验证进行阐述.
2.1 “PMR-one/two-2x”状态逻辑门验证
“PMR-one-2NOT”“PMR-one-2COPY”“PMR-two-2IMP”的电路结构由2个并联的忆阻器M1、M2和1个串联的电阻RS(50 Ω)构成. 在仿真验证时,根据逻辑门的状态转换,通过在Vin,Vout端口施加特定的操作电压,使忆阻器获得相应分压,由此实现不同的逻辑功能. 仿真结果如图5所示,对于每一个分图,最上方第1幅图展示了施加的电压激励,其他的图展示了各种逻辑状态变化情况下间隙距离的变化曲线.
2.2 “PMR-two/three-3x”状态逻辑门验证
“PMR-two-3NOR”“PMR-two-3OR”“PMR-three-3ONOR”的电路结构由3个并联的忆阻器M1、M2、M3和1个串联的电阻RS(50 Ω)构成,仿真结果如图6所示.
由于所有状态逻辑门的仿真皆基于相同参数的忆阻器模型,且它们的结构皆为兼容于忆阻交叉阵列的电路结构. 因此,可以认为这6种状态逻辑门可在由该参数忆阻器构成的交叉阵列中成功执行. 接下来,将介绍采用这6种状态逻辑门,依托忆阻交叉阵列完成复杂状态逻辑计算的低磨损综合映射方法.
3. 低磨损综合映射
本节介绍面向低磨损存内计算的多状态逻辑门综合映射方法. 该方法采用包含多种状态逻辑门的综合映射过程来降低复杂存内状态逻辑计算过程的翻转率(toggle rate),综合映射流程如图7所示.
首先,我们使用商用逻辑综合工具将复杂逻辑功能综合为由“PMR-one-2NOT”“PMR-two-3NOR”“PMR-two-3OR”逻辑功能构成的网表,在此过程中以门的总翻转率最小为优化目标. 然后,对该网表进行后处理,按照合并规则将可合并的简单门功能合并为复合门功能,从而进一步引入“PMR-two-2IMP”“PMR-three-3ONOR”“PMR-one-2COPY”(解决循环依赖[8])功能,合并过程同样以降低翻转率为判断条件. 最后,将经过后处理的网表功能与对应的状态逻辑门一一映射并将状态逻辑门按执行顺序配置到单行忆阻交叉阵列上,得到相应的状态逻辑门级联顺序和位置,并计算得到其翻转率.
3.1 状态逻辑门翻转率的计算
状态逻辑门的翻转率是其逻辑状态转变的平均概率. 以“PMR-two-3NOR”为例,输出忆阻器M3的初始状态为逻辑0(HRS),在经过逻辑操作后,4种情况中仅有1种情况的状态会发生改变,如图1中真值表所示. 因此,“PMR-two-3NOR”在进行逻辑操作时状态发生转变的平均概率为0.25.同理,“PMR-three-3ONOR”的翻转率为0.125. 表2列出了各个状态逻辑门的翻转率,该翻转率可以衡量门的磨损程度.
表 2 本文使用到的6种状态逻辑门的翻转率Table 2. Toggle Rates of the Six Stateful Logic Gates Used in This Paper状态逻辑门 翻转率 COPY 0.5 NOT 0.5 NOR 0.25 OR 0.25 IMP 0.25 ONOR 0.125 3.2 逻辑综合
复杂逻辑综合过程使用商用CMOS逻辑电路的综合工具完成复杂逻辑功能到状态逻辑门功能的分解. 具体综合过程如下:
首先,根据所使用的状态逻辑门功能定义单元库,即.lib文件. 从标准单元库中定义NOT,NOR,OR门作为一个新的自定义单元库. 然后,修改所定义门的面积(area)参数为对应状态逻辑门的翻转率. 最后,设置面积最小为综合目标,完成综合过程,得到由3种简单门功能构成的低翻转率的网表.
3.3 后处理
上一步得到的简单门功能网表中,可能存在{NOT,OR},{NOR,OR},{NOT,NOR},{NOR,NOR},{NOT,NOT}这些功能团组. 根据1.1节中描述的简单门和复合门对应关系以及逻辑等价性变换关系,可对网表进行后处理变换.
值得注意的是,为避免输入覆盖造成错误,在对简单门进行合并时,要遵循2个规则[8]:
1) 若合并后的复合门覆盖的输入同时也是其他状态逻辑门的输入时,则2个简单门不能合并;
2) 若第2个简单门的输入是其他复合门的被覆盖输入,则2个简单门不能合并.
在满足上述规则的情况下,可以进行的合并如表3所示.
表 3 状态逻辑门的合并Table 3. Merges of Stateful Logic Gates情况 功能团组 合并后 1 {NOR,OR} ONOR 2 {NOT,OR} IMP 3 {NOT,NOR} IMP,NOT 4 {NOR,NOR} ONOR,NOT 5 {NOT,NOT} NOT 由合并前后的翻转率计算可知,进行情况3合并后翻转率保持不变(0.5+0.25 = 0.25 + 0.5),进行情况4合并后翻转率会上升(0.25+0.25 < 0.125 + 0.5). 在以低磨损为目标的综合映射方法中,还需分别对情况3和情况4进行处理.
为使得翻转率进一步降低,应取消情况3的合并,保留下来的NOR门和NOT门可以进行其他使得翻转率降低的合并.
对于情况4的处理,单纯地像情况3那样取消合并,并不能得到预期的优化效果. 这是由于情况4中第2个NOR分解为OR和NOT门后,会出现2个NOT门相连的情况,满足情况5. 可以同时考虑情况4和情况5合并使得翻转率进一步降低,新的合并过程为:
NOR+ NOR+ NOT=>
NOR+ OR+ NOT+ NOT=>
ONOR+ NOT.
综上所述,后处理阶段状态逻辑门的合并规则为:
1) NOR(0.25)+ OR(0.25)=> ONOR(0.125);
2) NOT(0.5)+ OR(0.25)=> IMP(0.25);
3) NOR(0.25)+ NOR(0.25)+ NOT(0.5)=>
NOR(0.25)+ OR(0.25)+ NOT(0.5)+ NOT(0.5)=>
ONOR(0.125)+ NOT(0.5);
4) NOT(0.5)+ NOT(0.5)=> NOT(0.5).
其中,前3种变换,式子左右逻辑功能等价,前一个简单门的输入可直接指向复合门. 而第4种变换,前一个NOT门的输入即为正确的输出,直接连向其他状态逻辑门即可.
3.4 映 射
完成后处理过程后,得到新的包含复合门功能的网表,进一步需要基于该网表的级联关系,完成状态逻辑门到忆阻交叉阵列的映射. 本文遵循LOSSS中的映射方法[8],以单指令多数据(single instruction multiple data,SIMD)计算场景为背景,采用了面向行/列的映射模式,允许同时执行复杂逻辑的多个实例,每个实例都压缩到交叉阵列的一行中. 通过修改现有的SIMPLER MAGIC映射工具[15],以满足对多状态逻辑门映射的需求.
首先,读入经过后处理的网表文件,识别所使用的逻辑功能并根据逻辑功能匹配到相应的状态逻辑门,提取相应逻辑结构以及节点信息.
其次,确定状态逻辑门的执行顺序. 状态逻辑门的执行顺序与该门所代表的节点的单元使用值(cell usage,CU)有关. 在SIMPLER MAIGC的算法中,该值为执行一个门所需要的内存单元(作为输入的节点)的估计值[15]. 单元使用值较大的门应该先执行,由此作为该门的输入节点所占用的忆阻器单元可以尽早地被释放,重新分配新的节点. 此外,为保证逻辑功能的正确性,复合门所代表的节点应在其所有兄弟节点中最后被映射和执行.
最后,根据设定的阵列宽度(row size)为每个节点分配内存单元,得到整个逻辑执行的延迟和重用单元数. 其中,每一个节点包含3个状态:1)可使用(available)状态;2)已使用(used)状态;3)尚未初始化(uninitialized)状态. 处于状态3)的节点经过初始化转变为状态1);处于状态1)的节点可以被分配使用,并转为状态2);当处于状态2)的节点不再参与后续执行时,可以释放并重用该节点. 此外,在对映射后结果统计时,重用单元的平均翻转率记为0.5,需要计入总翻转率.
4. 结果与讨论
为评价优化的效果,我们采用提出的低磨损综合映射方法对EPFL[32],LGSynth91[33]基准电路测试集进行实验测试.LGSynth91是一个在集成电路(IC)设计和测试领域广泛使用的基准电路集合,包含了多种用于评估和设计优化算法的标准电路. 相较于LGSynth91,EPFL测试集的电路规模更大,对逻辑优化工具提出了更高的要求. 本文分别选取EPFL,LGSynth91中的10个测试电路,经过综合映射后,统计最终状态逻辑门映射序列的延迟和翻转率,并与采用当前2种典型的状态逻辑综合映射工具SIMPLER MAGIC[15],LOSSS[8]得到的结果进行对比.
4.1 实验设置
为了公平比较,3个综合映射流程的CMOS逻辑综合阶段均使用相同的商用CMOS逻辑电路综合工具. 本文的方法和LOSSS的自定义单元库中包含OR,NOR,NOT门,而SIMPLER MAGIC中仅包含NOR,NOT门. 本文所提出的低磨损综合映射方法的自定义单元库中各个门的面积参数设置为其对应的翻转率,而LOSSS,SIMPLER MAGIC的自定义单元库中各个门的面积参数设置为相同的值. 除自定义单元库不同之外,综合环境、综合约束等均与原流程保持一致.
在进行3种综合映射方法的比较时,每个测试样例映射的阵列宽度设置为3个综合映射流程能够进行综合映射的最小宽度的最大值. 表4中罗列了3个综合映射流程下各个测试集的最小阵列宽度,再对每个测试集取阵列宽度的最大值,即为最终的阵列宽度.
表 4 阵列宽度选取Table 4. Selection of Row SizeEPFL 测试电路 本文 LOSSS SIMPLER MAGIC 阵列宽度选取 adder 510 463 390 510 arbiter 2189 2147 1719 2189 bar 636 636 399 636 cavlc 168 169 124 169 ctrl 54 56 45 56 dec 371 371 267 371 int2float 50 59 41 59 max 870 854 783 870 priority 250 191 194 250 voter 1235 1110 1354 1354 LGSynth91 测试电路 本文 LOSSS SIMPLER MAGIC 阵列宽度选取 alu2 74 80 78 80 cm138a 30 30 17 30 cm42a 22 25 16 25 cmb 38 36 27 38 cht 92 94 88 94 term1 88 70 70 88 f51m 37 42 32 42 mux 31 31 31 31 ttt2 64 67 57 67 z4ml 18 16 20 20 4.2 实验结果
针对3种综合映射流程,分别得到各个测试样例的执行延迟和翻转率. 为进行更为直观的比较,选取SIMPLER MAGIC综合映射流程所得的结果为基准值,分别计算本文和LOSSS相较于SIMPLER MAGIC在执行延迟和翻转率2个指标上优化的比例(即数值下降的百分比),如图8~11所示.
由图8~11中数据可知:在执行延迟指标上,与SIMPLER MAGIC相比,本文所述综合映射流程在EPFL测试集上有最高45.35%和最低15.94%的降低,整体上平均有24.18%的降低;在LGSynth91测试集上有最高51.35%和最低21.74%的降低,整体上平均有34.67%的降低. 本文与LOSSS所得结果相差不大,在2个测试集下平均仅有不到1.20%的差距. 可以看到,虽然本文报道的综合映射方法是基于降低整个计算过程器件的总翻转率进行优化的,但是在计算延迟上相较于先前报道的LOSSS工具亦有改善. 这可能是由于目前所采用的商用CMOS综合映射工具基于启发式算法来进行优化,基于CMOS关键路径的延时优化是工具默认的优先级最高的综合优化属性. 而本文在基于CMOS的综合流程中,将门的面积设置为状态逻辑门翻转率的方法,或许产生了更好的起始网表,故而使得后处理和映射之后的状态逻辑门序列的总延时也有所降低.
在翻转率指标上,与SIMPLER MAGIC相比,本文所述综合映射流程在EPFL测试集上有最高61.82%和最低21.94%的降低,整体上平均有35.55%的降低;在LGSynth91测试集上有最高65.52%和最低30.88%的降低,整体上平均有47.26%的降低. 同时,本文在2个测试集上相较于LOSSS综合映射流程平均分别有8.48%和6.72%的降低,与本文低磨损工具研发的初衷一致. 特别地,本文所述综合映射流程在mux测试电路上的总翻转率高于LOSSS.这是由于总翻转率的计算包含了单元重用部分,若仅考虑后处理结束后的结果,本文所述综合映射流程下的翻转率仍然低于LOSSS.
综上所述,本文提出的综合流程与SIMPLER MAGIC 和 LOSSS相比,在翻转率和执行延迟上均得到了一定优化.
5. 总 结
在本文工作中,首先验证了多种状态逻辑门对同一忆阻存储阵列的兼容性. 然后以翻转率最优为约束,研究面向忆阻存储阵列内低磨损计算的状态逻辑综合映射方法,建立了包含多种状态逻辑门的复杂逻辑计算综合映射流程,可以针对任意给定计算功能,给出低磨损的状态逻辑门级联序列和位置,具有重要的理论意义. 后续的工作中可以考虑加入更多的状态逻辑门或是选择更优的状态逻辑门组合. 同时,可以综合考虑多个优化目标,在阵列规模、处理时效以及器件寿命上,取得更优的折中.
作者贡献声明:赵安宁与许诺为共同第一作者,许诺提出了论文的总体框架和算法思路,赵安宁完善了想法和算法细节并完成了实验和结果分析;许诺和赵安宁撰写论文的主体部分;刘康和罗莉参与了想法和方案讨论;所有作者都参与了论文讨论和修改.
-
表 1 不同微调方法的比较
Table 1 Comparison of Different Fine-Tuning Methods
表 2 大模型联邦高效微调框架总结
Table 2 Summary of Federated Efficient Fine-tuning Framework for Large Models
表 3 大模型压缩关键技术相关工作分类
Table 3 Classification of Related Work for Large Model Compression
参数剪枝 结构化剪枝 剪除冗余结构,降低模型大小和计算复杂度 文献[45−48] 非结构
化剪枝实现权重稀疏化,减小模型内存使用量和计算量,依赖特定软硬件加速模型张量运算 文献[49−51] 知识蒸馏 白盒蒸馏 产生特定领域下的小模型,减少模型尺寸和计算量,同时保持模型在特定任务下的性能 文献[52−60] 黑盒蒸馏 在不访问大模型内部结构的情况下,实现蒸馏过程,产生特定领域的小模型 文献[61−70] 模型量化 训练后
量化降低模型存储大小、节省存储、内存、带宽、计算量,同时保持模型精度 文献[71−81] 量化感
知训练降低模型量化误差,在降低模型存储、内存、带宽、计算量的前提下,进一步保持模型精度 文献[82−86] 低秩分解 — 减少模型参数量,实现推理加速 文献[87−91] “—”表示没有更细致的类别划分. 表 4 大模型推理加速技术相关工作分类
Table 4 Classification of Related Work for Large Model Inference Acceleration Technology
表 5 大模型边缘部署框架总结
Table 5 Summary of Edge Deployment Frameworks for Large Models
适用性 框架 特点 量化 多模
型支持跨平台支持 通用 TFLite[146] 在移动设备、嵌入式设备和loT设备上运行模型,支持多种开发语言和硬件加速 √ √ √ TorchExec[147] PyTorch平台下边缘部署工具,兼容多种计算平台并具有轻量级运行时 √ √ √ MNN[148] 轻量级的深度神经网络引擎,对模型格式、算子、设备、操作系统具有广泛的兼容性 √ √ √ NCNN[149] 适用于移动端的神经网络推理框架,无第三方依赖 √ √ √ 专用 MLC-LLM[150] 使用机器学习编译技术加速推理 √ √ √ llama.cpp[151] C/C++中LLM推理 √ √ √ llama2.c[152] 纯C语言环境执行Llama推理 √ Mllm[153] 适用于移动和边缘设备的多模态推理引擎 √ √ √ Intel Extension for Transformers[154] 在英特尔平台上提供LLM高效推理 √ √ InferLLM[155] 轻量级LLM推理框架,可部署至移动设备 √ √ TinyChatEngine[156] 支持多种设备上的多种量化方法 √ √ √ NanoLLM[157] 为NVIDIA Jetson设计的轻量级LLM推理引擎 √ -
[1] OpenAI. ChatGPT: Optimizing language models for dialogue[EB/OL]. (2022-12-30)[2024-02-10]. https://openai.com/blog/chatgpt/#rf2
[2] Achiam J, Adler S, Agarwal S, et al. GPT−4 technical report[J]. arXiv preprint, arXiv: 2303.08774, 2023
[3] Touvron H, Lavril T, Izacard G, et al. LLaMA: Open and efficient foundation language models[J]. arXiv preprint, arXiv: 2302.13971, 2023
[4] Liu Haotian, Li Chunyuan, Wu Qingyang, et al. Visual instruction tuning[J]. arXiv preprint, arXiv: 2304.08485, 2023
[5] Kirillov A, Mintun E, Ravi N, etc. Segment anything[J]. arXiv preprint, arXiv: 2304.02643, 2023
[6] Touvron H, Martin L, Stone K, el al. Llama 2: Open foundation and fine-tuned chat models[J]. arXiv preprint, arXiv: 2307.09288, 2023
[7] 王睿,齐建鹏,陈亮,等. 面向边缘智能的协同推理综述[J]. 计算机研究与发展,2021,60(2):398−414 Wang Rui, Qi Jianpeng, Chen Liang, et al. Survey of collaborative inference for edge intelligence[J]. Journal of Computer Research and Development, 2021, 60(2): 398−414 (in Chinese)
[8] Alizadeh K, Mirzadeh I, Belenko D, et al. LLM in a flash: Efficient large language model inference with limited memory[C]//Proc of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg, PA: ACL, 2024: 12562–12584
[9] Mcmahan H B, Moore E, Ramage D, et al. Communication-efficient learning of deep networks from decentralized data[C]//Proc of the 20th Int Conf on Artificial Intelligence and Statistics PMLR. New York: ACM, 2017: 1273−1282
[10] Custers B, Sears A M, Dechesne F, et al. EU Personal Data Protection in Policy and Practice[M]. The Hague, The Netherlands: TMC Asser Press, 2019
[11] Lambda. OpenAI’s GPT−3 language model: A technical overview[EB/OL]. (2020-06-03)[2024-01-08]. https://lambdalabs.com/blog/demystifying-gpt-3#1
[12] Ananthaswamy A. In AI, is bigger always better?[J]. Nature, 2023, 615(7951): 202−205 doi: 10.1038/d41586-023-00641-w
[13] Brown T, Mann B, Ryder N, et al. Language models are few-shot learners[C]//Proc of the 33rd Int Conf on Neural Information Processing Systems. New York: ACM, 2020: 1877−1901
[14] Lv Kai, Yang Yuqing, Liu Tengxiao, et al. Full parameter fine-tuning for large language models with limited resources[J]. arXiv preprint, arXiv: 2306.09782, 2023
[15] Lv Kai, Yan Hang, Guo Qipeng, et al. AdaLomo: Low-memory optimization with adaptive learning rate[J]. arXiv preprint, arXiv: 2310.10195, 2023
[16] Malladi S, Gao Tianyu, Nichani E, et al. Fine-tuning language models with just forward passes[J]. arXiv preprint, arXiv: 2305.17333, 2023
[17] Ding Ning, Qin Yujia, Yang Guang, et al. Parameter-efficient fine-tuning of large-scale pre-trained language models[J]. Nature Machine Intelligence, 2023, 5(3): 220−235 doi: 10.1038/s42256-023-00626-4
[18] Chen Chaochao, Feng Xiaohua, Zhou Jun, et al. Federated large language model: A position paper[J]. arXiv preprint, arXiv: 2307.08925, 2023
[19] Houlsby N, Giurgiu A, Jastrzebski S, et al. Parameter-efficient transfer learning for NLP[C]//Proc of the 36th Int Conf on Machine Learning PMLR. New York: ACM, 2019: 2790−2799
[20] Hu Zhiqiang, Lan Yihuai, Wang Lei, et al. LLM-adapters: An adapter family for parameter-efficient fine-tuning of large language models[J]. arXiv preprint, arXiv: 2304.01933, 2023
[21] Karimi M, Henderson J, Ruder S. Compacter: Efficient low-rank hypercomplex adapter layers[C]//Proc of the 34th Int Conf on Neural Information Processing Systems. New York: ACM, 2021: 1022−1035
[22] Li X, Liang P. Prefix-tuning: Optimizing continuous prompts for generation[J]. arXiv preprint, arXiv: 2101.00190, 2021
[23] Zhang Renrui, Han Jiaming, Zhou Aojun, et al. Llama-adapter: Efficient fine-tuning of language models with zero-init attention[J]. arXiv preprint, arXiv: 2303.16199, 2023
[24] Lester B, Al-Rfou R, Constant N. The power of scale for parameter-efficient prompt tuning[J]. arXiv preprint, arXiv: 2104.08691, 2021
[25] Sun Tianxiang, He Zhengfu, Zhu Qin, et al. Multitask pre-training of modular prompt for chinese few-shot learning[C]//Proc of the 61st Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2023: 11156−11172
[26] Gu Yuxian, Han Xu, Liu Zhiyuan, et al. PPT: Pre-trained prompt tuning for few-shot learning[C]//Proc of the 60th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2022: 8410−8423
[27] Zhang Qingru, Chen Minshuo, Bukharin A, et al. Adaptive budget allocation for parameter-efficient fine-tuning[J]. arXiv preprint, arXiv: 2303.10512, 2023
[28] Chen Yukang, Qian Shengju, Tang Haotian, et al. Longlora: Efficient fine-tuning of long-context large language models[J]. arXiv preprint, arXiv: 2309.12307, 2023
[29] Chua T J, Yu Wenhan, Zhao Jun, et al. FedPEAT: Convergence of federated learning, parameter-efficient fine tuning, and emulator assisted tuning for artificial intelligence foundation models with mobile edge computing[J]. arXiv preprint, arXiv: 2310.17491, 2023
[30] Che Tianshi, Liu Ji, Zhou Yang, et al. Federated learning of large language models with parameter-efficient prompt tuning and adaptive optimization[J]. arXiv preprint, arXiv: 2310.15080, 2023
[31] Babakniya S, Elkordy A R, Ezzeldin Y H, et al. SLoRA: Federated parameter efficient fine-tuning of language models[J]. arXiv preprint, arXiv: 2308.06522, 2023
[32] Zhang Zhuo, Yang Yuanhang, Dai Yong, et al. FedPETuning: When federated learning meets the parameter-efficient tuning methods of pre-trained language models[C]//Proc of the 61st Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2023: 9963−9977
[33] Kuang Weirui, Qian Bingchen, Li Zitao, et al. Federatedscope-llm: A comprehensive package for fine-tuning large language models in federated learning[J]. arXiv preprint, arxiv: 2309.00363, 2023
[34] Fan Tao, Kang Yan, Ma Guoqiang, et al. Fate-llm: A industrial grade federated learning framework for large language models[J]. arXiv preprint, arxiv: 2310.10049, 2023
[35] Chen Haokun, Zhang Yao, Krompass D, et al. FedDAT: An approach for foundation model finetuning in multi-modal heterogeneous federated Learning[J]. arXiv preprint, arXiv: 2308.12305, 2023
[36] Guo Tao, Guo Song, Wang Junxiao, et al. Promptfl: Let federated participants cooperatively learn prompts instead of models-federated learning in age of foundation model[J]. IEEE Transactions on Mobile Computing, 2023, 23(5): 5179−5194
[37] Xu Mengwei, Yin Wangsong, Cai Dongqi, et al. A survey of resource-efficient LLM and multimodal foundation models[J]. arXiv preprint, arXiv: 2401.08092, 2024
[38] Wan Zhongwei, Wang Xin, Liu Che, et al. Efficient large language models: A survey[J]. arXiv preprint, arXiv: 2312.03863, 2023
[39] Miao Xupeng, Oliaro G, Zhang Zhihao, et al. Towards efficient generative large language model serving: A survey from algorithms to systems[J]. arXiv preprint, arXiv: 2312.15234, 2023
[40] Kachris C. A survey on hardware accelerators for large language models[J]. arXiv preprint, arXiv: 2401.09890, 2024
[41] Zhong Juan, Liu Zheng, Chen Xi. Transformer-based models and hardware acceleration analysis in autonomous driving: A survey[J]. arXiv preprint, arXiv: 2304.10891, 2023
[42] Emani M, Foreman S, Sastry V, et al. A comprehensive performance study of large language models on novel AI accelerators[J]. arXiv preprint, arXiv: 2310.04607, 2023
[43] 张晓东,张朝昆,赵继军. 边缘智能研究进展[J]. 计算机研究与发展,2023,60(12):2749−2769 doi: 10.7544/issn1000-1239.202220192 Zhang Xiaodong, Zhang Chaokun, Zhao Jijun. State-of-the-Art survey on edge intelligence[J]. Journal of Computer Research and Development, 2023, 60(12): 2749−2769 (in Chinese) doi: 10.7544/issn1000-1239.202220192
[44] Zhu Xunyu, Li Jian, Liu Yong, et al. A survey on model compression for large language models[J]. arXiv preprint, arXiv: 2308.07633, 2023
[45] Ma Xinyin, Fang Gongfan, Wang Xinchao. LLM-Pruner: On the structural pruning of large language models[J]. arXiv preprint, arXiv: 2305.11627, 2023
[46] Xia Mengzhou, Gao Tianyu, Zeng Zhiyuan, et al. Sheared LLaMA: Accelerating language model pre-training via structured pruning[J]. arXiv preprint, arXiv: 2310.06694, 2023
[47] Wang Hanrui, Zhang Zhekai, Han Song. SpAtten: Efficient sparse attention architecture with cascade token and head pruning[C]//Proc of the 27th IEEE Int Symp on High-Performance Computer Architecture. Piscataway, NJ: IEEE, 2021: 97−110
[48] Zhang Mingyang, Chen Hao, Shen Chunhua, et al. LoRAPrune: Pruning meets low-rank parameter-efficient fine-tuning[J]. arXiv preprint, arXiv: 2305.18403, 2023
[49] Xia Haojun, Zheng Zhen, Li Yuchao, et al. Flash-LLM: Enabling cost-effective and highly-efficient large generative model inference with unstructured sparsity[J]. arXiv preprint, arXiv: 2309.10285, 2023
[50] Frantar E, Alistarh D. SparseGPT: Massive language models can be accurately pruned in one-shot[C]//Proc of the 40th Int Conf on Machine Learning PMLR. New York: ACM, 2023: 10323−10337
[51] Sun Mingjie, Liu Zhuang, Bair A, et al. A simple and effective pruning approach for large language models[J]. arXiv preprint, arXiv: 2306.11695, 2023
[52] Liang Chen, Zuo Simiao, Zhang Qingru, et al. Less is more: Task-aware layer-wise distillation for language model compression[C]//Proc of the 40th Int Conf on Machine Learning PMLR. New York: ACM, 2023: 20852−20867
[53] Zhang Chen, Song Dawei, Ye Zheyu, et al. Towards the law of capacity gap in distilling language models[J]. arXiv preprint, arXiv: 2311.07052, 2023
[54] Padmanabhan S, Onoe Y, Zhang M, et al. Propagating knowledge updates to LMs through distillation[J]. arXiv preprint, arXiv: 2306.09306, 2023
[55] Agarwal R, Vieillard N, Zhou Yongchao, et al. On-policy distillation of language models: Learning from self-generated mistakes[J]. arXiv preprint, arXiv: 2306.13649, 2024
[56] Gu Yuxian, Dong Li, Wei Furu, et al. Knowledge distillation of large language models[J]. arXiv preprint, arXiv: 2306.08543, 2023
[57] Timiryasov I, Tastet J L. Baby llama: Knowledge distillation from an ensemble of teachers trained on a small dataset with no performance penalty[J]. arXiv preprint, arXiv: 2308.02019, 2023
[58] Xiong Yunyang, Varadarajan B, Wu Lemeng, et al. EfficientSAM: Leveraged masked image pretraining for efficient segment anything[J]. arXiv preprint, arXiv: 2312.00863, 2023
[59] Yuan Jianlong, Phan M H, Liu Liyang, et al. FAKD: Feature augmented knowledge distillation for semantic segmentation[C]//Proc of the 2024 IEEE/CVF Winter Conf on Applications of Computer Vision. Piscataway, NJ: IEEE, 2024: 595−605
[60] Nasser S A, Gupte N, Sethi A. Reverse knowledge distillation: Training a large model using a small one for retinal image matching on limited data[C]//Proc of the 2024 IEEE/CVF Winter Conf on Applications of Computer Vision. Piscataway, NJ: IEEE, 2024: 7778−7787
[61] Zhu Xuekai, Qi Biqing, Zhang Kaiyan, et al. PaD: Program-aided distillation specializes large models in reasoning[J]. arXiv preprint, arXiv: 2305.13888, 2023
[62] Li L H, Hessel J, Yu Youngjae, et al. Symbolic chain-of-thought distillation: Small models can also “think” step-by-step[C]//Proc of the 61st Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2023: 2665−2679
[63] Shridhar K, Stolfo A, Sachan M. Distilling reasoning capabilities into smaller language models[C]//Proc of the 61st Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2023: 7059−7073
[64] Ho N, Schmid L, Yun S Y. Large language models are reasoning teachers[C]//Proc of the 61st Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2023: 14852−14882
[65] Wang Peifeng, Wang Zhengyang, Li Zheng, et al. SCOTT: Self-consistent chain-of-thought distillation[C]//Proc of the 61st Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2023: 5546−5558
[66] Hsieh C Y, Li C L, Yeh C K, et al. Distilling step-by-step! Outperforming larger language models with less training data and smaller model sizes[C]//Proc of the 61st Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2023: 8003−8017
[67] Chen Zeming, Gao Qiyue, Bosselut A, et al. DISCO: Distilling counterfactuals with large language models[C]//Proc of the 61st Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2023: 5514−5528
[68] Jiang Yuxin, Chan C, Chen Mingyang, et al. Lion: Adversarial distillation of proprietary large language models[C]//Proc of the 2023 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL 2023: 3134−3154
[69] Fu Yao, Peng Hao, Ou Litu, et al. Specializing smaller language models towards multi-step reasoning[C]//Proc of the 40th Int Conf on Machine Learning PMLR. New York: ACM, 2023: 10421−10430
[70] Wu Minghao, Waheed A, Zhang Chiyu, et al. LaMini-LM: A diverse herd of distilled models from large-scale instructions[J]. arXiv preprint, arXiv: 2304.14402, 2024
[71] Lin Ji, Tang Jiaming, Tang Haotian, et al. AWQ: Activation-aware weight quantization for LLM compression and acceleration[J]. arXiv preprint, arXiv: 2306.00978, 2023
[72] Li Qingyuan, Zhang Yifan, Li Liang, et al. FPTQ: Fine-grained post-training quantization for large language models[J]. arXiv preprint, arXiv: 2308.15987, 2023
[73] Wei Xiuying, Zhang Yunchen, Zhang Xiangguo, et al. Outlier suppression: Pushing the limit of low-bit transformer language models[C]//Proc of the 36th Int Conf on Neural Information Processing Systems. New York: ACM, 2022: 17402−17414
[74] Wei Xiuying, Zhang Yunchen, Li Yuhang, et al. Outlier suppression+: Accurate quantization of large language models by equivalent and effective shifting and scaling[C]//Proc of the 2023 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2023: 1648−1665
[75] Guo Cong, Tang Jiaming, Hu Weiming, et al. OliVe: Accelerating large language models via hardware-friendly outlier-victim pair quantization[C/OL]//Proc of the 50th Annual Int Symp on Computer Architecture. New York: ACM, 2023[2024-9-10]. https: /doi.org/10.1145/3579371.3589038
[76] Yao Zhewei, Yazdani A R, Zhang Minjia, et al. ZeroQuant: Efficient and affordable post-training quantization for large-scale transformers[C]//Proc of the 36th Int Conf on Neural Information Processing Systems. New York: ACM, 2022: 27168−27183
[77] Dettmers T, Lewis M, Belkada Y, et al. LLM. int8(): 8-bit matrix multiplication for transformers at scale[C]//Proc of the 36th Int Conf on Neural Information Processing Systems. New York: ACM, 2022: 30318−30332
[78] Frantar E, Ashkboos S, Hoefler T, et al. GPTQ: Accurate quantization for generative pre-trained transformers[C/OL]//Proc of the 11th Int Conf on Learning Representations. OpenReview. net, 2023[2024-09-10]. https://openreview.net/forum?id=tcbBPnfwxS
[79] Xiao Guangxuan, Lin Ji, Seznec M, et al. SmoothQuant: Accurate and efficient post-training quantization for large language models[C]//Proc of the 40th Int Conf on Machine Learning PMLR. New York: ACM, 2023: 38087−38099
[80] Dettmers T, Svirschevski R, Egiazarian V, et al. SpQR: A sparse-quantized representation for near-lossless LLM weight compression[J]. arXiv preprint, arXiv: 2306.03078, 2023
[81] Lee Changhun, Jin Jungyu, Kim T, et al. OWQ: Outlier-aware weight quantization for efficient fine-tuning and inference of large language models[C]//Proc of the 38th AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2024: 13355−13364
[82] Wang Hongyu, Ma Shuming, Dong Li, et al. BitNet: Scaling 1-bit transformers for large language models[J]. arXiv preprint, arXiv: 2310.11453, 2023
[83] Dettmers T, Pagnoni A, Holtzman A, et al. QLoRA: Efficient finetuning of quantized LLMs[C]//Proc of the 37th Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, 2023: 10088−10115
[84] Kim J, Lee J H, Kim S, et al. Memory-efficient fine-tuning of compressed large language models via sub−4-bit integer quantization[C]//Proc of the 36th Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, 2023: 36187−36207
[85] Liu Zechun, Oguz B, Zhao Changsheng, et al. LLM-QAT: Data-free quantization aware training for large language models[J]. arXiv preprint, arXiv: 2305.17888, 2023
[86] Liu Xinyu, Wang Tao, Yang Jiaming, et al. MPQ-YOLO: Ultra low mixed-precision quantization of YOLO for edge devices deployment[J]. Neurocomputing, 2024, 574: 127210 doi: 10.1016/j.neucom.2023.127210
[87] Kaushal A, Vaidhya T, Rish I. LORD: Low rank decomposition of monolingual code LLMs for one-shot compression[C/OL]//Proc of the 41st ICML 2024 Workshop on Foundation Models in the Wild. OpenReview. net, 2024[2024-09-10]. https://openreview.net/forum?id=br49PQvuMp
[88] Li Yixiao, Yu Yifan, Zhang Qingru, et al. LoSparse: Structured compression of large language models based on low-rank and sparse approximation[C]//Proc of the 40th Int Conf on Machine Learning. New York: PMLR, 2023: 20336−20350
[89] Xu Mingxue, Xu Yaolei, Mandic D P. TensorGPT: Efficient compression of the embedding layer in LLMs based on the tensor-train decomposition[J]. arXiv preprint, arXiv: 2307.00526, 2023
[90] Chang C C, Sung Y Y, Yu Shixing, et al. FLORA: Fine-grained low-rank architecture search for vision transformer[C]//Proc of the 2024 IEEE/CVF Winter Conf on Applications of Computer Vision. Piscataway, NJ: IEEE, 2024: 2482−2491
[91] Benedek N, Wolf L. PRILoRA: Pruned and rank-increasing low-rank adaptation[J]. arXiv preprint, arXiv: 2401.11316, 2024
[92] Cheng Hongrong, Zhang Miao, Shi J Q. A survey on deep neural network pruning-taxonomy, comparison, analysis, and recommendations[J]. arXiv preprint, arXiv: 2308.06767, 2023
[93] Xu Xiaohan, Li Ming, Tao Chongyang, et al. A survey on knowledge distillation of large language models[J]. arXiv preprint, arXiv: 2402.13116, 2024
[94] Zhu Xunyu, Li Jian, Liu Yong, et al. A survey on model compression for large language models[J]. arXiv preprint, arXiv: 2308.07633, 2023
[95] Hu E, Shen Yelong, Wallis P, et al. LoRA: Low-rank adaptation of large language models[C/OL]//Proc of the 10th Int Conf on Learning Representations. OpenReview. net, 2022[2024-09-10]. https://openreview.net/forum?id=nZeVKeeFYf9
[96] Liu Jing, Gong Ruihao, Wei Xiuying, et al. QLLM: Accurate and efficient low-bitwidth quantization for large language models[C/OL]//Proc of the 12th Int Conf on Learning Representations. OpenReview. net, 2024[2024-09-10]. https://openreview.net/forum?id=FIplmUWdm3
[97] Xiao Guangxuan, Tian Yuandong, Chen Beidi, et al. Efficient streaming language models with attention sinks[C/OL]//Proc of the 12th Int Conf on Learning Representations. OpenReview. net, 2024[2024-09-10]. https://openreview.net/forum?id=NG7sS51zVF
[98] Liu Zichang, Desai A, Liao Fangshuo, et al. Scissorhands: Exploiting the persistence of importance hypothesis for LLM KV cache compression at test time[C]//Proc of the 37th Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, 2023: 52342−52364
[99] Zhang Zhenyu, Sheng Ying, Zhou Tianyi, et al. H2O: Heavy-hitter oracle for efficient generative inference of large language models[C]//Proc of the 37th Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, 2023: 34661−34710
[100] Ge Suyu, Zhang Yunan, Liu Liyuan, et al. Model tells you what to discard: Adaptive KV cache compression for LLMs[C/OL]//Proc of the 12th Int Conf on Learning Representations. OpenReview. net, 2024[2024-09-10]. https://openreview.net/forum?id=uNrFpDPMyo
[101] Hooper C, Kim S, Mohammadzadeh H, et al. KVQuant: Towards 10 million context length LLM Inference with KV cache quantization[J]. arXiv preprint, arXiv: 2401.18079, 2024
[102] Kwon W, Li Zhuohan, Zhuang Siyuan, et al. Efficient memory management for large language model serving with pagedattention[C]//Proc of the 29th Symp on Operating Systems Principles. New York: ACM, 2023: 611−626
[103] Del C L, Del G A, Agarwal S, et al. SkipDecode: Autoregressive skip decoding with batching and caching for efficient LLM inference[J]. arXiv preprint, arXiv: 2307.02628, 2023
[104] Zeng Dewen, Du Nan, Wang Tao, et al. Learning to skip for language modeling[J]. arXiv preprint, arXiv: 2311.15436, 2023
[105] Schuster T, Fisch A, Gupta J, et al. Confident adaptive language modeling[C]//Proc of the 36th Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, 2022: 17456−17472
[106] Sun Tianxiang, Liu Xiangyang, Zhu Wei, et al. A simple hash-based early exiting approach for language understanding and generation[J]. arXiv preprint, arXiv: 2203.01670, 2022
[107] Liao Kaiyuan, Zhang Yi, Ren Xuancheng, et al. A global past-future early exit method for accelerating inference of pre-trained language models[C]//Proc of the 2021 Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg, PA: ACL, 2021: 2013−2023
[108] Kong Jun, Wang Jin, Yu L C, et al. Accelerating inference for pretrained language models by unified multi-perspective early exiting[C]//Proc of the 29th Int Conf on Computational Linguistics. Stroudsburg, PA: ACL, 2022: 4677−4686
[109] Zeng Ziqian, Hong Yihuai, Dai Hongliang, et al. ConsistentEE: A consistent and hardness-guided early exiting method for accelerating language models inference[C]//Proc of the 38th AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2024: 19506−19514
[110] Bae S, Ko J, Song H, et al. Fast and robust early-exiting framework for autoregressive language models with synchronized parallel decoding[C]//Proc of the 2023 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2023: 5910−5924
[111] Valmeekam C S K, Narayanan K K D, Kalathil D, et al. LLMZip: Lossless text compression using large language models[J]. arXiv preprint, arXiv: 2306.04050, 2023
[112] Chevalier A, Wettig A, Ajith A, et al. Adapting language models to compress contexts[C]//Proc of the 2023 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2023: 3829−3846
[113] Li Yucheng, Dong Bo, Guerin F, et al. Compressing context to enhance inference efficiency of large language models[C]//Proc of the 2023 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2023: 6342−6353
[114] Jiang Huiqiang, Wu Qianhui, Lin C Y, et al. LLMLingua: Compressing prompts for accelerated inference of large language models[C]//Proc of the 2023 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2023: 13358−13376
[115] Jiang Huiqiang, Wu Qianhui, Luo Xufang, et al. LongLLMLingua: Accelerating and enhancing LLMs in long context scenarios via prompt compression[C]//Proc of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg, PA: ACL, 2024: 1658−1677
[116] Fu Yichao, Bailis P, Stoica I, et al. Break the sequential dependency of LLM inference using lookahead decoding[C]//Proc of the 41st Int Conf on Machine Learning. New York: PMLR, 2024: 14060−14079
[117] Leviathan Y, Kalman M, Matias Y. Fast inference from transformers via speculative decoding[C]//Proc of the 40th int Conf on Machine Learning. New York: PMLR, 2023: 19274−19286
[118] Miao Xupeng, Oliaro G, Zhang Zhihao, et al. SpecInfer: Accelerating generative large language model serving with tree-based speculative inference and verification[C]//Proc of the 29th ACM Int Conf on Architectural Support for Programming Languages and Operating Systems, Volume 3. New York: ACM, 2024: 932–949
[119] Cai T, Li Yuhong, Geng Zhengyang, et al. Medusa: Simple LLM inference acceleration framework with multiple decoding heads[C]//Proc of the 41st int Conf on Machine Learning. New York: PMLR, 2024: 5209−5235
[120] Li Yuhui, Wei Fangyun, Zhang Chao, et al. EAGLE: Speculative sampling requires rethinking feature uncertainty[C]//Proc of the 41st int Conf on Machine Learning. New York: PMLR, 2024: 28935−28948
[121] Xu Daliang, Yin Wangsong, Jin Xin, et al. LLMCad: Fast and scalable on-device large language model inference[J]. arXiv preprint, arXiv: 2309.04255, 2023
[122] Shen Haihao, Chang Hanwen, Dong Bo, et al. Efficient llm inference on cpus[J]. arXiv preprint, arXiv: 2311.00502, 2023
[123] Dao T, Fu Dan, Ermon S, et al. FlashAttention: Fast and memory-efficient exact attention with IO-awareness[C]//Proc of the 36th Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, 2022: 16344−16359
[124] Dao T. FlashAttention−2: Faster attention with better parallelism and work partitioning[C/OL]//Proc of the 12th Int Conf on Learning Representations. OpenReview. net, 2024[2024-09-10]. https://openreview.net/forum?id=mZn2Xyh9Ec
[125] Dao T, Haziza D, Massa F, et al. Flash-Decoding for long-context inference[EB/OL]. 2023[2024-02-03]. https://pytorch.org/blog/flash-decoding/
[126] Hong Ke, Dai Guohao, Xu Jiaming, et al. FlashDecoding++: Faster large language model inference with asynchronization, flat GEMM optimization, and heuristics[C/OL]//Proc of Machine Learning and Systems. 2024: 148−161[2024-09-12]. https://proceedings.mlsys.org/paper_files/paper/2024/hash/5321b1dabcd2be188d796c21b733e8c7-Abstract-Conference. html
[127] Lai Ruihang, Shao Junru, Feng Siyuan, et al. Relax: Composable abstractions for end-to-end dynamic machine learning[J]. arXiv preprint, arXiv: 2311.02103, 2023
[128] Tillet P, Kung H T, Cox D. Triton: An intermediate language and compiler for tiled neural network computations[C]//Proc of the 3rd ACM SIGPLAN Int Workshop on Machine Learning and Programming Languages. New York: ACM, 2019: 10−19
[129] Feng Siyuan, Hou Bohan, Jin Hongyi, et al. TensorIR: An abstraction for automatic tensorized program optimization[C]//Proc of the 28th ACM Int Conf on Architectural Support for Programming Languages and Operating Systems: Volume 2. New York: ACM, 2023: 804−817
[130] Liu Zichang, Wang Jue, Dao T, et al. Deja Vu: Contextual sparsity for efficient LLMs at inference time[C]//Proc of the 40th Int Conf on Machine Learning. New York: PMLR, 2023: 22137−22176
[131] Sheng Ying, Zheng Lianmin, Yuan Binhang, et al. FlexGen: High-throughput generative inference of large language models with a single GPU[C]//Proc of the 40th Int Conf on Machine Learning. New York: PMLR, 2023: 31094−31116
[132] Song Yixin, Mi Zeyu, Xie Haotong, et al. PowerInfer: Fast large language model serving with a consumer-grade GPU[J]. arXiv preprint, arXiv: 2312.12456, 2023
[133] Yi Rongjie, Guo Liwei, Wei Shiyun, et al. EdgeMoE: Fast on-device inference of MoE-based large language models[J]. arXiv preprint, arXiv: 2308.14352, 2023
[134] Awais M, Naseer M, Khan S, et al. Foundational models defining a new era in vision: A survey and outlook[J]. arXiv preprint, arXiv: 2307.13721, 2023
[135] Tang Shengkun, Wang Yaqing, Kong Zhenglun, et al. You need multiple exiting: Dynamic early exiting for accelerating unified vision language model[C]//Proc of the 44th IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2023: 10781−10791
[136] Li Zi, Tian Lin, Mok C W, et al. Samconvex: Fast discrete optimization for ct registration using self-supervised anatomical embedding and correlation pyramid[G]//Proc of the 26th Medical Image Computing and Computer Assisted Intervention(MICCAI 2023). Berlin: Springer, 2023: 559−569
[137] Zhou Chong, Loy C C, Dai Bo. Extract free dense labels from CLIP[C]//Proc of the 17th Computer Vision(ECCV 2022). Berlin: Springer, 2022: 696−712
[138] Sanghi A, Chu Hang, Lambourn J G, et al. Clip-forge: Towards zero-shot text-to-shape generation[C]//Proc of the 2022 IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2022: 18603−18613
[139] Vaswani A, Shazeer N, Parmar N, et al. Attention is All you Need[C]//Proc of the 31st Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, 2017: 5999−6009
[140] InternLM. LMDeploy[EB/OL]. 2023[2024-02-04]. https://github.com/InternLM/lmdeploy
[141] Microsoft. DeepSpeed-MII[EB/OL]. 2022[2024-02-04]. https://github.com/microsoft/DeepSpeed-MII
[142] NVIDIA. TensorRT-LLM[EB/OL]. 2023[2024-02-04]. https://github.com/NVIDIA/TensorRT-LLM
[143] vLLM Team. vLLM[EB/OL]. 2023[2024-02-04]. https://github.com/vllm-project/vllm
[144] Lin Ji, Chen Weiming, Lin Yujun, et al. MCUNet: Tiny deep learning on IoT devices[C]//Proc of the 34th Advances in Neural Information Processing Systems. Cambridge, MA: MIT Press, 2020: 11711−11722
[145] Neuralmagic. DeepSparse[EB/OL]. 2021[2024-02-04]. https://github.com/neuralmagic/deepsparse
[146] 李双峰. TensorFlow Lite:端侧机器学习框架[J]. 计算机研究与发展,2020,57(9):1839−1853 doi: 10.7544/issn1000-1239.2020.20200291 Li Shuangfeng. TensorFlow lite: On-device machine learning framework[J]. Journal of Computer Research and Development, 2020, 57(9): 1839−1853 (in Chinese) doi: 10.7544/issn1000-1239.2020.20200291
[147] PyTorch Team. PyTorch ExecuTorch[EB/OL]. 2023[2024-05-28]. https://pytorch.org/executorch
[148] Alibaba. MNN[EB/OL]. 2019[2024-06-30]. https://github.com/alibaba/MNN
[149] Tencent. ncnn[EB/OL]. 2017[2024-05-30]. https://github.com/Tencent/ncnn
[150] MLC Team. MLC LLM[EB/OL]. 2023[2024-02-04]. https://github.com/mlc-ai/mlc-llm
[151] Gerganov G. llama. cpp[EB/OL]. 2023[2024-02-04]. https://github.com/ggerganov/llama.cpp
[152] Karpathy A. llama2. c[EB/OL]. 2023[2024-02-04]. https://github.com/karpathy/llama2.c
[153] Mllm Team. mllm[EB/OL]. 2023[2024-02-04]. https://github.com/UbiquitousLearning/mllm
[154] Intel. Intel Extension for Transformers[EB/OL]. 2022[2024-02-04]. https://github.com/intel/intel-extension-for-transformers
[155] Megvii Inc. InferLLM[EB/OL]. 2023[2024-02-04]. https://github.com/MegEngine/InferLLM
[156] MIT Han Lab. TinyChatEngine[EB/OL]. 2023[2024-02-04]. https://github.com/mit-han-lab/TinyChatEngine
[157] NVIDIA. NanoLLM[EB/OL]. 2024[2024-04-28]. https://github.com/dusty-nv/NanoLLM
[158] Shazeer N. Fast transformer decoding: One write-head is all you need[J]. arXiv preprint, arXiv: 1911.02150, 2019
[159] Ainslie J, Lee-Thorp J, de Jong M, et al. GQA: Training generalized multi-query transformer models from multi-head checkpoints[C]//Proc of the 2023 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2023: 4895−4901
[160] Choromanski K M, Likhosherstov V, Dohan D, et al. Rethinking attention with performers[C/OL]//Proc of the 9th Int Conf on Learning Representations. OpenReview. net, 2021[2024-09-10]. https://openreview.net/forum?id=Ua6zuk0WRH
[161] Shazeer N. Glu variants improve transformer[J]. arXiv preprint, arXiv: 2002.05202, 2020
[162] Lepikhin D, Lee H, Xu Yuanzhong, et al. GShard: Scaling giant models with conditional computation and automatic sharding[C/OL]//Proc of the 9th Int Conf on Learning Representations. OpenReview. net, 2021[2024-09-10]. https://openreview.net/forum?id=qrwe7XHTmYb
[163] Fedus W, Zoph B, Shazeer N. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity[J]. Journal of Machine Learning Research, 2022, 23(1): 120: 5232−120: 5270
[164] Gu A, Dao T. Mamba: Linear-time sequence modeling with selective state spaces[J]. arXiv preprint, arXiv: 2312.00752, 2023
[165] Peng Bo, Alcaide E, Anthony Q, et al. RWKV: Reinventing RNNs for the transformer era[C]//Proc of the Findings of the Association for Computational Linguistics(EMNLP 2023). Stroudsburg, PA: ACL, 2023: 14048−14077
[166] Sun Yutao, Dong Li, Huang Shaohan, et al. Retentive network: A successor to transformer for large language models[J]. arXiv preprint, arXiv: 2307.08621, 2023
[167] 徐志伟,曾琛,朝鲁,等. 面向控域的体系结构:一种智能万物互联的体系结构风格[J]. 计算机研究与发展,2019,56(1):90−102 doi: 10.7544/issn1000-1239.2019.20180775 Xu Zhiwei, Zeng Chen, Zhao Lu, et al. Domain oriented architecture: An architectural style of intelligent interconnection of all things[J]. Journal of Computer Research and Development, 2019, 56(1): 90−102 (in Chinese) doi: 10.7544/issn1000-1239.2019.20180775
[168] 李国杰. 对大数据的再认识[J]. 大数据,2015,1(1):8−16 doi: 10.11959/j.issn.2096-0271.2015.01.001 Li Guojie. Further understanding of big data[J]. Big Data, 2015, 1(1): 8−16 (in Chinese) doi: 10.11959/j.issn.2096-0271.2015.01.001
[169] Woisetschläger H, Isenko A, Wang Shiqiang, et al. Federated fine-tuning of llms on the very edge: The good, the bad, the ugly[C]//Proc of the 8th Workshop on Data Management for End-to-End Machine Learning. New York: ACM, 2024: 39−50
[170] Yang Chengxu, Xu Mengwei, Wang Qipeng, et al. Flash: Heterogeneity-aware federated learning at scale[J]. IEEE Transactions on Mobile Computing, 2024, 23(1): 483−500 doi: 10.1109/TMC.2022.3214234
[171] Lu Wang, Hu Xixu, Wang Jindong, et al. FedCLIP: Fast generalization and personalization for CLIP in federated learning[J]. IEEE Data Engineering Bulletin, 2023, 46(1): 52−66
[172] 矣晓沅,谢幸. 大模型道德价值观对齐问题剖析[J]. 计算机研究与发展,2023,60(9):1926−1945 doi: 10.7544/issn1000-1239.202330553 Yi Xiaoyuan, Xie Xing. An analysis of the alignment of moral values in the large model[J]. Journal of Computer Research and Development, 2023, 60(9): 1926−1945 (in Chinese) doi: 10.7544/issn1000-1239.202330553