Loading [MathJax]/jax/output/SVG/jax.js
  • 中国精品科技期刊
  • CCF推荐A类中文期刊
  • 计算领域高质量科技期刊T1类
高级检索

机器学习辅助微架构功耗建模和设计空间探索综述

翟建旺, 凌梓超, 白晨, 赵康, 余备

翟建旺, 凌梓超, 白晨, 赵康, 余备. 机器学习辅助微架构功耗建模和设计空间探索综述[J]. 计算机研究与发展, 2024, 61(6): 1351-1369. DOI: 10.7544/issn1000-1239.202440074
引用本文: 翟建旺, 凌梓超, 白晨, 赵康, 余备. 机器学习辅助微架构功耗建模和设计空间探索综述[J]. 计算机研究与发展, 2024, 61(6): 1351-1369. DOI: 10.7544/issn1000-1239.202440074
Zhai Jianwang, Ling Zichao, Bai Chen, Zhao Kang, Yu Bei. Machine Learning for Microarchitecture Power Modeling and Design Space Exploration:A Survey[J]. Journal of Computer Research and Development, 2024, 61(6): 1351-1369. DOI: 10.7544/issn1000-1239.202440074
Citation: Zhai Jianwang, Ling Zichao, Bai Chen, Zhao Kang, Yu Bei. Machine Learning for Microarchitecture Power Modeling and Design Space Exploration:A Survey[J]. Journal of Computer Research and Development, 2024, 61(6): 1351-1369. DOI: 10.7544/issn1000-1239.202440074
翟建旺, 凌梓超, 白晨, 赵康, 余备. 机器学习辅助微架构功耗建模和设计空间探索综述[J]. 计算机研究与发展, 2024, 61(6): 1351-1369. CSTR: 32373.14.issn1000-1239.202440074
引用本文: 翟建旺, 凌梓超, 白晨, 赵康, 余备. 机器学习辅助微架构功耗建模和设计空间探索综述[J]. 计算机研究与发展, 2024, 61(6): 1351-1369. CSTR: 32373.14.issn1000-1239.202440074
Zhai Jianwang, Ling Zichao, Bai Chen, Zhao Kang, Yu Bei. Machine Learning for Microarchitecture Power Modeling and Design Space Exploration:A Survey[J]. Journal of Computer Research and Development, 2024, 61(6): 1351-1369. CSTR: 32373.14.issn1000-1239.202440074
Citation: Zhai Jianwang, Ling Zichao, Bai Chen, Zhao Kang, Yu Bei. Machine Learning for Microarchitecture Power Modeling and Design Space Exploration:A Survey[J]. Journal of Computer Research and Development, 2024, 61(6): 1351-1369. CSTR: 32373.14.issn1000-1239.202440074

机器学习辅助微架构功耗建模和设计空间探索综述

基金项目: 国家重点研发计划项目(2022YFB2901100);香港特别行政区研究资助局(CUHK14210723);北京市自然科学基金项目(4244107)
详细信息
    作者简介:

    翟建旺: 1996年生. 博士,特聘副研究员. 主要研究方向为机器学习辅助的EDA算法,包括微架构功耗建模、设计空间探索、物理设计

    凌梓超: 2000年生. 学士. 主要研究方向为计算机体系结构、功耗建模

    白晨: 1998年生. 博士研究生.主要研究方向为计算机体系结构、电子设计自动化

    赵康: 1982年生.博士,教授,博士生导师. CCF高级会员.主要研究方向为电子设计自动化、面向FPGA的编译优化、异构计算系统

    余备: 1983年生. 博士,副教授,博士生导师. 主要研究方向为电子设计自动化、机器学习

    通讯作者:

    赵康(zhaokang@bupt.edu.cn

  • 中图分类号: TP332

Machine Learning for Microarchitecture Power Modeling and Design Space Exploration:A Survey

Funds: This work was supported by the National Key Research and Development Program of China (2022YFB2901100), the Research Grants Council of Hong Kong SAR (CUHK14210723), and the Beijing Natural Science Foundation (4244107).
More Information
    Author Bio:

    Zhai Jianwang: born in 1996. PhD, assistant professor. His main research interests include machine learning-assisted electronic design automation (EDA) algorithms, including microarchitecture power modeling, design space exploration, and physical design

    Ling Zichao: born in 2000. Bachelor. His main research interests include computer architecture and power modeling

    Bai Chen: born in 1998, PhD candidate. His main research interests include computer architecture and electronic design automation

    Zhao Kang: born in 1982, PhD, professor, PhD supervisor. Senior member of CCF. His main research interests include electronic design automation (EDA), compiling optimization for FPGA, and heterogenous computing systems

    Yu Bei: born in 1983. PhD, associate professor, PhD supervisor. His main research interests include electronic design automation (EDA) and machine learning

  • 摘要:

    微架构设计是处理器开发的关键阶段,处在整个设计流程的上游,直接影响性能、功耗、成本等核心设计指标. 在过去的数十年中,新的微架构设计方案,结合半导体制造工艺的进步,使得新一代处理器能够实现更高的性能和更低的功耗、成本. 然而,随着集成电路发展至“后摩尔时代”,半导体工艺演进所带来的红利愈发有限,功耗问题已成为高能效处理器设计的主要挑战. 与此同时,现代处理器的架构愈发复杂、设计空间愈发庞大,设计人员期望进行快速精确的指标权衡以获得更理想的微架构设计. 此外,现有的层层分解的设计流程极为漫长耗时,已经难以实现全局能效最优. 因此,如何在微架构设计阶段进行精确高效的前瞻性功耗估计和探索优化成为关键问题. 为了应对这些挑战,机器学习技术被引入到微架构设计流程中,为处理器的微架构建模和优化提供了高质量方案. 首先介绍了处理器的主要设计流程、微架构设计及其面临的挑战,然后阐述了机器学习辅助集成电路设计,重点在于使用机器学习技术辅助微架构功耗建模和设计空间探索的研究进展,最后进行总结展望.

    Abstract:

    Microarchitecture design is a key stage of processor development. It is at the upper level of the entire design flow and directly affects core metrics such as performance, power consumption, and cost. Over the past few decades, new microarchitecture solutions, coupled with advances in semiconductor manufacturing, have enabled newer generations of processors to achieve higher performance, lower power consumption and cost. However, as chip design enters the post-Moore era, the dividends from the evolution of semiconductor technology are increasingly limited, and power consumption has become a major challenge for energy-efficient processor design. Meanwhile, modern processors are becoming more complex in architecture and the design space is larger, requiring designers to make accurate design metrics tradeoffs to achieve the most desirable microarchitecture design. Moreover, the existing stage-by-stage decomposition of the development and validation flow is extremely lengthy and time-consuming, and it is difficult to achieve global energy efficiency optimization. Therefore, how to perform accurate and efficient power estimation and design space exploration at the microarchitecture design stage becomes a key issue. To tackle these challenges, machine learning has been introduced into the microarchitecture design process, providing efficient and accurate solutions for microarchitecture modeling and optimization. We firstly introduce the main design flow of processors, microarchitecture design and its major challenges, then amplify machine learning-assisted integrated circuit design, which focuses on research advances in the use of machine learning techniques to assist microarchitecture power modeling and design space exploration, and finally conclude with a summary and outlook.

  • 国密SM4算法[1]是一种常用的分组密码算法,广泛应用于数据保护、加密通信等领域. SM4算法常见工作模式有ECB(electronic codebook),CBC(cipher block chaining)等,对于相同的明文块,ECB模式下会产生完全相同的密文,而在CBC模式下,当前的明文块会与前一块的密文异或后进行运算. 因此,即使是完全相同的明文输入也可能会有完全不同的密文输出. 相比于ECB模式,CBC模式提供了更高的安全性和抵抗攻击的能力,有着更高的应用需求. 提高SM4算法在CBC模式下的性能,对于在边缘设备中使用SM4算法是至关重要的. 但是,在CBC模式下存在着难以提高吞吐率的问题:每组的输入必须等待前一组运算结束后才能获得,因而难以使用流水线方法提升吞吐率.

    文献[2]中提到了一种改进方法,将电路中的S盒以外的其他逻辑结构进行预计算,并把预计算的结果与S盒进行融合构成新的查找表,从而提高SM4算法在CBC模式下吞吐率. 本文基于此工作进一步优化了S盒的表示,并针对轮函数的迭代过程进行了优化,最终减少了轮函数关键路径上的2次异或运算,有效提高了算法的性能.

    本文的设计针对CBC模式下的SM4算法,在TSMC 40 nm,SMIC 55 nm工艺下,使用Synopsys Design Compiler分别进行了ASIC综合. 综合结果显示,本文所提出的设计在CBC模式下的吞吐率达到了4.2 Gb/s,同时单位面积吞吐量达到了129.4 Gb·s−1·mm−2,明显优于已发表的类似设计. 这些结果表明本文所提出的化简方法在改进SM4算法性能方面具有很大的潜力.

    本文的结构为:首先介绍了SM4算法及其在CBC模式下存在的性能瓶颈问题. 然后,详细描述了本文提出的2个化简方法,并解释了它们在轮函数迭代和S盒置换过程中的作用. 接下来,介绍了实验设计并给出了实验结果分析和对比. 最后,对进一步改进和应用的方向进行了展望.

    SM4算法是一种对称密钥密码算法,被广泛应用于数据加密和保护领域,它是中国密码算法的标准之一,具有较高的安全性和良好的性能.

    SM4采用了分组密码的设计思想,将明文数据划分为128 b的数据块,并通过密钥对每个数据块进行加密和解密操作. 对单组数据进行加解密的流程如图1所示,分为密钥扩展算法和加解密算法2部分. 图1中的FK是系统预设的参数,与用户密钥进行异或运算后作为密钥扩展算法的输入. 加解密算法接受密钥扩展算法产生的32轮轮密钥rki对明文进行加解密,最后经反序变换输出. 加解密使用的是同一套计算流程,唯一的区别是解密时使用轮密钥的顺序与加密过程相反.

    图  1  SM4算法工作流程
    Figure  1.  Workflow of SM4 algorithm

    密钥扩展算法和加解密算法2部分均由32次轮函数迭代构成,整体结构均采用4路并行的Feistel结构,在计算过程中,以128 b数据为输入、128 b数据为输出,其内部的运算逻辑如图2所示. 输出中的前96 b数据等于输入中的后96 b数据,输出后的32 b数据通过轮函数运算产生.

    图  2  4路并行的Feistel结构
    Figure  2.  Four parallel Feistel structure

    在密钥扩展算法中使用的密钥是算法给定的固定密钥,记作cki. 在加解密算法中使用的密钥是由密钥扩展算法通过用户给的密钥扩展出来的轮密钥,记作rki.

    SM4密钥扩展算法结构如图3所示,密钥扩展的主要过程包括32轮密钥扩展的轮函数,其中,密钥为128 b,FK为SM4标准中规定的一个128 b常数. 二者异或后的值将会作为密钥扩展轮函数的首轮输入,并通过一个选择器进行循环迭代,总计迭代32轮产生32个轮密钥.

    图  3  SM4的密钥扩展算法结构
    Figure  3.  Key expansion algorithm structure of SM4

    设用户输入的密钥为MK,则该密钥对应的32轮轮密钥可以按照式(1)求出:

    {(k0,k1,k2,k3)=MKFK,ki+4=kiF(ki+1ki+2ki+3cki),rki=ki+4, (1)

    其中,cki是系统预设的32 b参数,rki代表第i轮的轮密钥,F代表密钥扩展轮函数,其由S盒置换算法τ:Z322Z322和线性变换算法L(x)=x(x<<<13)(x<<<23)组成,其中<<<表示循环左移运算.

    SM4算法的加解密算法的整体结构与密钥扩展算法类似,均包含32轮的轮函数迭代,区别在于加解密算法中额外包含1次反序变换.

    SM4算法的轮函数迭代流程如图4所示,X1~X4为第1轮的输入,X2~X5为第1轮的输出,同时也是第2轮的输入. rk1为第1轮的轮密钥,T函数代表加解密模块的轮函数. 与密钥扩展部分的轮函数F类似,由S盒置换算法τ和一个线性变换算法L(x)=x(x<<<2)(x<<<10) (x<<<18)(x<<<24)组成.

    图  4  SM4加解密模块轮函数结构
    Figure  4.  Round function structure of SM4 encryption and decryption modules

    通过多轮的迭代过程,SM4算法能够实现高强度的数据加密和解密. 然而,在CBC模式下,由于相邻数据之间的依赖关系,传统的流水线技术难以提高算法的吞吐率. 因此,针对这一问题,本文提出了2种化简方法,以减少关键路径上的运算,从而提高SM4算法在CBC模式下的性能.

    加解密模块的轮函数的结构如图4所示,若不考虑T函数带来的时序延迟,单次轮函数迭代的关键路径上共包含3次异或运算. 以公式的形式描述SM4算法加解密轮函数的迭代关系可得到式(2):

    Xi+4=Xi(Xi+1Xi+2Xi+3rki). (2)

    若考虑相邻的2次轮函数迭代,则有:

    {Xi+4=XiT(Xi+1Xi+2Xi+3rki),Xi+5=XiT(Xi+2Xi+3Xi+4rki+1). (3)

    观察式(1)~(3)不难发现,由于SM4采用了4条数据线路的Feistel结构进行设计,在相邻的2次轮函数迭代过程中,均有96 b的输入是完全一致的,在式(3)的计算过程中,相邻2轮的轮函数将Xi+2Xi+3计算了2次.

    因此,一个简单的优化思路便是,我们在轮函数之间传递数据时,额外传递Xi+2Xi+3rki+1的运算结果,并作用于下一次计算,得到的流程图如图5所示.

    图  5  优化的轮函数结构
    Figure  5.  Optimized round function structure

    相比于图4的运算流程,在计算当前轮次的输出时,二次优化过后的轮函数通过提前获取下一轮次使用的密钥,并利用2轮之间相同的数据提前计算,可以使得在加解密的流程中总计节省32次异或运算的时间.

    S盒是密码学领域的一个基本组件,其功能是实现数据的非线性变换,在DES,AES,SM1,SM4等算法中均有应用. 在SM4算法中,其提供了一个8 b到8 b的非线性变换.

    在SM4算法中,S盒模块通常与另一个线性变换函数L组合使用,即图4图5中的T函数,其位于加解密算法轮函数的关键路径上,因此,如果能找到优化T函数关键路径的方法延时,也可以使得整个加解密模块的延时变小,进而提高运算效率.T函数的内部结构如图6所示,图中的<<<表示对32 b数据进行循环左移,关键路径包括1个S盒和3次异或运算. 在硬件实现中,循环移位可以通过硬件连线来实现,不会带来额外的路径延时.

    图  6  SM4加解密模块T函数结构
    Figure  6.  T function structure of SM4 encryption and decryption modules

    T函数中包含4次异或运算,反映到电路设计中,其关键路径上至少存在3次异或运算. 因此,一个优化思路便是,将算法中的S盒的输入输出修改为8 b输入、32 b输出[2-3] ,并提前将L函数作用于图中的4个S盒,如图7所示. 图7中,通过编码的形式保存其运行结果,将图6中的SBox与后续的线性变换L组合形成exSBox,之后仅需要将4个exSBox的输出异或即可,从而减少了1次异或运算.

    图  7  优化的T函数结构
    Figure  7.  Optimized T-function structure

    虽然修改后的S盒比原先的S盒输出了更多的数据,但在硬件实现中,仍然是通过相同数量的多路选择器查表输出. 因此修改前后的S盒的路径延时及其安全性并未改变.

    图7中的exSBox1为例,使用0xff作为输入展示exSBox1的构造方式,首先获得0xff作用于S盒后的运行结果0x48. 由于exSBox1的输入对应最高四位,因此,将其拓展为32 b数据为0x48000000. 在经过L函数后,得到的值是0x68492121. 如表1所示,表中前5行加粗部分表示传入的数据及其循环移位后所处位置,其余位置在任意输入下都恒等于0.

    表  1  搜索空间降低比率和命中率
    Table  1.  Search Space Reduction Rate and Hit Rate
    原数据 01001000 00000000 00000000 00000000
    <<<2 00100000 00000000 00000000 00000001
    <<<10 00000000 00000000 00000001 00100000
    <<<18 00000000 00000001 00100000 00000000
    <<<24 00000000 01001000 00000000 00000000
    异或和 01101000 01001001 00100001 00100001
    注:加粗部分表示传入的数据及其循环移位后所处位置.
    下载: 导出CSV 
    | 显示表格

    观察表1的运算结果不难发现,除最后一行加粗数字表示的第0~5位,第14,15位由异或运算产生,其余的24位均是输入的8位数据的排列组合,因此在硬件设计时,可以仅使用8 b输入、16 b输出的S盒实现. 对于图7中剩余的3个exSBox,在相同的输入下,可以通过对表1中的数据进行循环移位,得到对应的输出. 上述结论对4个位于不同部位的S盒均成立.

    具体而言,令p为输入的8 b数据,τ(p)为标准SM4算法中S盒的输出. X=(x0,x1,,x15)为exSBox1中存储的16 b数据,Y=(y0,y1,,y31)为优化后的T函数中需要的32 b输出. τ为SM4算法标准中使用的S盒置换函数,其对于8 b输入,产生对应的8 b输出,则X可以由式(4)产生:

    {(x0,x1,,x7)=τ(p),(x8,x9,,x15)=τ(p)(τ(p)<<<2). (4)

    表1可知,Y的取值实际上可以由X经过排列组合得到,对于exSBox2,exSBox3,exSBox4的取值,可以通过Y循环移位得到,且由于该过程中仅包含赋值运算,在电路设计中可以通过物理连线完成. 相比于文献[2]中的设计,节约了1/3的面积消耗. 具体的计算方式如式(5)所示.

    {(y0,y1,,y5)=(x8,x9,,x13)(y6,y7=(x6,x7)(y8,y9,,y13)=(x0,x1,,x5)(y14,y15=(x14,x15)(y16,y17,,y21)=(x2,x3,,x7)(y22,y23=(x0,x1)(y24,y25,,y29)=(x2,x3,,x7)(y30,y31=(x0,x1). (5)

    现场可编程逻辑门阵列(FPGA)和专用集成电路(ASIC)是目前主流使用硬件电路实现密码算法的2个方式. FPGA虽然具有可编程性、灵活性和快速设计等优势,但ASIC相较于FPGA拥有更高的性能,与本文设计追求的高效率目标相符,所以选择在ASIC下实现.

    SM4硬件系统的整体结构设计如图8所示,包括密钥扩展模块、加解密模块和适配CBC工作模式的组合逻辑. 对于单个加解密任务,若明文被分为n组,会执行1次密钥扩展和n次加解密. 因此,优化加解密算法的执行效率是优化SM4硬件设计的重点. 本文所提出的2种化简方法,对于每一组明文输入,可以减少64级异或门的延时,极大地提升了运算效率.

    图  8  SM4硬件整体架构
    Figure  8.  Overall architecture of SM4 hardware

    SM4算法的硬件实现主要有2种方案:一种方案是流水线结构,即通过寄存器连接多个加解密模块同时工作以提高加解密的效率,如图9(a)所示;另一种方案是使用循环迭代的方式. 即一次性提取32个轮函数中的n轮组合成一个组合电路,称为n合1电路,如图9(b)所示. 流水线结构的优势是可以充分利用n个加密核心的性能,在不影响整体工作频率的情况下加速运算. 对于SM4算法而言,在合理范围内堆叠流水线可以实现极高的吞吐量.

    图  9  流水线结构与循环迭代结构
    Figure  9.  Pipeline architecture and loop iteration architecture

    然而,流水线结构仅适用于ECB等数据无前后依赖的工作模式. 在CBC工作模式下,由于需要将前一轮的输出与本轮的输入进行异或运算,相邻的数据存在依赖,故而无法使用流水线加速运算. 因此,在本设计中没有选用流水线结构.

    虽然循环迭代结构会降低整体模块的工作频率,对吞吐量的提升较为有限,但可以同时兼容 ECB,CBC这 2种工作模式. 本设计最终选择了循环迭代的设计方式.

    在SM4算法中,密钥扩展与加解密算法类似,均包含32轮迭代. 密钥扩展模块采用图2所示的单轮组合逻辑电路循环32次来实现32轮迭代.

    在密钥扩展模块的输出端,使用寄存器存放每一轮电路的轮密钥,标号为0~31,如图10所示. 标号从0开始的好处是:在解密时,使用到的密钥顺序相反的,加密的第k轮使用的是第k1号密钥,解密的第k轮使用的是第32k号密钥. 在二进制下,二者的标号可以通过取反操作相互转化.

    图  10  轮密钥的存储与使用
    Figure  10.  Storage and usage of round keys

    为了保证运算结果的准确性,密钥扩展模块还 会向加解密模块发出控制信号表明自己的工作状态,以避免在轮密钥尚未完全更新时使用错误的轮密钥进行加解密.

    在国家标准文档[1]中,并没有针对CBC工作模式给出具体的测试用例. 因此,本文设计方案通过完整的Verilog HDL语言实现,通过在FPGA平台进行综合、仿真和上板验证,以确保功能正确并进行相关性能分析,如图11所示. 具体而言,通过PCIE上位机下发随机的明文数据到FPGA开发板,开发板完成加密后传回上位机,通过与软件对比实现功能验证. 若在循环验证多次后二者的输出均完全相同,则认为设计的SM4电路的功能正确.

    图  11  测试流程
    Figure  11.  Testing procedures

    最终,本文的设计在Zynq 7020 FPGA开发板上完成了上板验证,确保了功能的正确性,工作频率最高可达95 MHz,吞吐量约为1.5 Gb/s.

    ASIC上主要针对2种工艺SMIC 55 nm与 TSMC 40 nm进行了测试、通过Synopsys公司的EDA工具DesignCompiler进行时序等综合约束,我们选择了芯片面积和芯片使用的逻辑门数量(gates)作为评估指标,其结果如表2表3所示,在CBC模式下,本文的设计在3.97 mW的功耗下,单位面积吞吐率达129.4 Gb·s−1·mm−2,明显优于同类设计. 此外,以使用逻辑门的数量为评估标准,本文提出的设计在该指标上也明显优于同类设计,单位面积吞吐率为0.205×10−3 Gb·s−1·gates−1.

    表  2  SM4综合结果与面积效率对比
    Table  2.  Comparison of SM4 Synthesis Results and Area Efficiency
    工艺节点 芯片面积/mm2 吞吐率/(Gb·s−1 单位面积吞吐率/
    (Gb·s−1·mm−2
    功耗/mW
    40 nm* 0.0335 4.34 129.40 3.97
    55 nm* 0.0877 4.41 50.30 10.88
    65 nm[2] 0.1260 5.24 41.59
    180 nm[4] 0.0790 0.10 1.27 5.31
    55 nm[5] 0.0870 0.40 4.59 4.35
    350 nm[6] 0.0270 0.412 15.26
    注:*标注的表示本文的结果.
    下载: 导出CSV 
    | 显示表格
    表  3  SM4综合结果与门效率对比
    Table  3.  Comparison of SM4 Synthesis Results and Gates Efficiency
    工艺节点 gates 吞吐率/(Gb·s−1 单位面积吞吐率/
    (Gb·s−1·gates−1
    40 nm* 21.2×103 4.34 0.205×10−3
    55 nm* 21.1×103 4.41 0.209×10−3
    180 nm[6] 32.0×103 0.80 0.025×10−3
    65 nm[7] 31.0×103 1.23 0.040×10−3
    55 nm[8] 22.0×103 0.27 0.012×10−3
    130 nm[9] 22.0×103 0.80 0.036×10−3
    注:*标注的表示本文的结果.
    下载: 导出CSV 
    | 显示表格

    在不同工艺、电压下对该设计进行综合,可以得到本文设计在不同使用场景下的吞吐率. 在TSMC 40 nm、SMIC 55 nm、SMIC 130 nm下使用不同的工艺角分别对本文的设计进行综合,结果如表4所示.

    表  4  不同工艺角下的SM4综合结果与效率对比
    Table  4.  Comparison of SM4 Synthesis Results and Efficiency with Different Process Corners
    工艺节点 工艺角 面积/gates 吞吐率/(Gb·s−1 功耗/mW
    40 nm 0.99V/125°C/SS 21.0×103 2.40 2.55
    1.1V/25°C/TT 21.2×103 4.34 3.97
    1.21V/0°C/FF 20.9×103 6.96 8.35
    55 nm 1V/25°C/TT 20.0×103 2.78 4.10
    1.2V/25°C/TT 21.1×103 4.41 10.88
    1.32V/0°C/FF 17.8×103 6.84 33.59
    130 nm 1.08V/125°C/SS 20.8×103 1.11 6.86
    1.2V/25°C/TT 21.0×103 1.75 15.70
    1.32V/0°C/FF 21.8×103 2.45 23.03
    下载: 导出CSV 
    | 显示表格

    根据本文提出的2种对SM4加解密模块关键路径进行化简以及降低面积的方法,实现了4合1的SM4电路,并基于Zynq7020开发板进行了功能验证. 此外,ASIC综合结果表明本文的SM4电路相比于其他方案有更高的单位面积吞吐率和更低的功耗. 因此,这种对SM4算法进行的优化是有效的,并且对其他分组算法提高CBC模式下的单位面积吞吐率具有参考价值.

    作者贡献声明:郝泽钰提出研究方案并完成了论文的撰写;代天傲、黄亦成、段岑林协助完成了ASIC平台上的验证实验;董进、吴世勇、张博、王雪岩、贾小涛提出指导意见并修改论文;杨建磊提出指导意见并讨论定稿.

  • 图  1   处理器芯片设计流程示意图

    Figure  1.   Illustration of processor chip design flow

    图  2   RISC-V BOOM微架构示意图

    Figure  2.   Illustration of RISC-V BOOM microarchitecture

    图  3   基于机器学习的EDA出版物数量及占比统计[21]

    Figure  3.   Statistics of numbers and percentage on EDA publications based on machine learning[21]

    图  4   处理器功耗建模方法对比

    Figure  4.   Comparison of processor power modeling methods

    图  5   运行时模型转换为设计时模型[57]

    Figure  5.   Converting runtime models to design-time models[57]

    图  6   Kumar等人[60]提出的功耗建模流程

    Figure  6.   Power modeling flow proposed by Kumar et al[60]

    图  7   PowerTrain示意图[35]

    Figure  7.   Illustration of PowerTrain[35]

    图  8   McPAT-Calib框架流程图[64]

    Figure  8.   Flowchart of McPAT-Calib framework[64]

    图  9   PANDA框架示意图[64]

    Figure  9.   Illustration of PANDA framework[64]

    图  10   Zhai等人[65]提出的基于迁移学习的微架构功耗建模流程

    Figure  10.   Transfer learning-based microarchitecture power modeling flow proposed by Zhai et al. [65]

    图  11   NoCeption框架示意图[68]

    Figure  11.   Illustration of NoCeption framework [68]

    图  12   功耗建模一般流程

    Figure  12.   Common power modeling flow

    图  13   ArchRanker框架示意图[75]

    Figure  13.   Illustration of ArchRanker framework[75]

    图  14   结合统计采样和AdaBoost学习的设计空间探索方法[77]

    Figure  14.   Design space exploration methodology combining statistical sampling and AdaBoost learning[77]

    图  15   BOOM-Explorer框架示意图[79]

    Figure  15.   Illustration of BOOM-Explorer framework[79]

    图  16   基于强化学习的微架构设计空间探索[80]

    Figure  16.   RL-based microarchitecture design space exploration[80]

    图  17   IT-DSE框架示意图[83]

    Figure  17.   Illustration of IT-DSE framework[83]

    图  18   微架构DAG和VGAE的训练方法[84]

    Figure  18.   Training methods of microarchitecture DAG and VGAE [84]

    图  19   设计空间探索一般流程

    Figure  19.   The general flow of design space exploration

    表  1   机器学习辅助的微架构功耗建模方法总结

    Table  1   A Summary of Machine Learning-Assisted Methods for Microarchitecture Power Modeling

    模型/文献 使用阶段 适用范围 建模特征 建模方法 模型误差
    PowerTrain[35] 运行时 不同微架构、不同负载 PMC+硬件描述 线性回归 约2%
    WattWatcher[56] 运行时 不同微架构、不同负载 PMC+硬件描述 线性回归 平均2.67%
    文献[53] 运行时 单一微架构、不同负载 PMC 线性回归 <9%
    文献[54] 运行时 单一微架构、不同负载 PMC 线性回归 2.8%~3.8%
    文献[55] 运行时 单一微架构、不同负载 PMC 非线性回归 平均6.8%
    文献[51] 设计时 不同微架构 微架构设计参数 非线性回归 中位5.4%
    文献[52] 设计时 单一微架构、不同负载 仿真活动统计 线性回归 约2.5%
    文献[57] 设计时 单一微架构、不同负载 仿真活动统计 线性回归 平均5.9%
    文献[5859] 设计时 不同微架构 微架构设计参数 神经网络 <2%
    文献[60] 设计时 单一微架构、不同负载 外部输入信号 机器学习模型 约3.6%
    McPAT-Calib[6364] 设计时 不同微架构、不同负载 架构参数、活动统计 解析模型+机器学习 3%~6%
    PANDA[65] 设计时 不同微架构、不同负载 架构参数、活动统计 解析函数+机器学习 2%~8%
    文献[66] 设计时 不同微架构、不同负载 架构参数、活动统计 神经网络+迁移学习 平均4.4%
    TrEnDSE[67] 设计时 不同微架构、跨负载 架构参数 集成模型+迁移学习 <1%
    NoCeption[68] 设计时 不同NoC配置及拓扑 架构参数 图神经网络 约2.5%
    下载: 导出CSV

    表  2   BOOM的微架构设计空间

    Table  2   Microarchitecture Design Space of BOOM

    模块 组件参数 描述 备选值
    前端 FetchWidth 一次性可取回指令数 4,8
    FetchBufferEntry 取指缓冲条目数 8,16,24,32,35,40
    RasEntry 返回地址堆栈条目数 16,24,32
    BranchCount 同时推测分支数 8,12,16,20
    ICacheWay ICache组相连数 2,4,8
    ICacheTLB ICache地址翻译缓冲路 8,16,32
    ICacheFetchBytes ICache行容量 2,4
    指令解码单元 DecodeWidth 一次性最多解码指令数 1,2,3,4,5
    RobEntry 重排序缓冲条目数 32,64,96,128,130
    IntPhyRegister 整型寄存器数 48,64,80,96,112
    FpPhyRegister 浮点型寄存器数 48,64,80,96,112
    执行单元 MemIssueWidth 存储型指令发射宽度 1,2
    IntIssueWidth 整型指令发射宽度 1,2,3,4,5
    FpIssueWidth 浮点型指令发射宽度 1,2
    加载存储单元 LDQEntry 加载缓冲条目 8,16,24,32
    STQEntry 存储缓冲条目 8,16,24,32
    DCacheWay D-Cache 组相联数 2,4,8
    DCacheMSHR 缺失状态处理寄存器数 2,4,8
    DCacheTLB DCache地址翻译缓冲路 8,16,32
    下载: 导出CSV

    表  3   BOOM处理器的不同微架构设计参数选择

    Table  3   Design Parameters Selectivity of Different Microarchitecture for BOOM Processors

    方法 微架构组件配置参数
    原始两发射BOOM[1415] {4, 16, 32, 12, 4, 8, 2, 2, 64, 80, 64, 1, 2, 1, 16, 16, 4, 2, 8}
    BOOM-Explorer[78] {4, 16, 16, 8, 2, 8, 2, 2, 32, 64, 64, 1, 3, 1, 24, 24, 8, 4, 8}
    BOOM-Explorer[79] {4, 16, 16, 8, 4, 8, 2, 2, 32, 64, 64, 1, 3, 1, 24, 24, 8, 4, 8}
    下载: 导出CSV

    表  4   基于机器学习的微架构设计空间探索方法总结

    Table  4   A Summary of Machine Learning-Based Methods for Microarchitecture Design Space Exploration

    方法/文献 探索目标 探索方法 PPA等数据来源
    文献[51] 帕累托前沿、流水线深度以及异构性分析 设计空间采样、统计学习 Turandot仿真器、PowerTimer工具
    文献[5859] 设计空间的预测模型 使用ANN建模遍历子空间 SESC仿真器、CACTI等
    文献[73] 设计空间的预测模型 使用ANN和线性回归建模遍历子空间 仿真器、Wattch、CACTI等
    ArchRanker[75] 特定目标下最优设计 基于RankBoost排名模型遍历子空间 仿真器、Wattch、CACTI等
    文献[77] 预测模型及最优设计 基于AdaBoost.RT模型和正交阵列采样 gem5仿真器
    BOOM-Explorer[7879] 帕累托最优设计 基于贝叶斯优化和深度核高斯过程建模 商业EDA工具
    文献[80] 特定偏好下最优设计 基于微架构缩放图的强化学习 商业EDA工具
    文献[82] 帕累托最优设计 基于集成树建模和主动学习 商业EDA工具
    IT-DSE[83] 帕累托最优设计 基于贝叶斯优化、不变风险最小化和Transformer 商业EDA工具
    GRL-DSE[84] 帕累托最优设计 基于图神经网络、集成模型、贝叶斯优化 商业EDA工具
    文献[86] 帕累托最优设计 基于BagGBRT和上置信界超体积提升 仿真器、McPAT等工具
    MoDSE[8788] 帕累托最优设计 基于AdaGBRT和帕累托超体积提升 仿真器、McPAT等工具
    文献[89] ML加速器最优设计 机器学习模型、图神经网络、贝叶斯优化 商业EDA工具
    SoC-Tuner[90] SoC帕累托最优设计 设计空间剪枝、贝叶斯优化 商业EDA工具
    下载: 导出CSV
  • [1] 国务院. 新时期促进集成电路产业和软件产业高质量发展若干政策[EB/OL]. [2023-12-25]. https://www.gov.cn/zhengce/content/2020-08/04/content_5532370.htm

    The State Council. Several policies to promote the high-quality development of integrated circuit industry and software industry in the new era[EB/OL]. [2023-12-25]. https://www. gov. cn/zhengce/content/2020-08/04/content_5532370. htm (in Chinese)

    [2] 陈云霁,蔡一茂,汪玉,等. 集成电路未来发展与关键问题——第347期“双清论坛(青年)”学术综述[J]. 中国科学:信息科学,2024,54(1):1−15

    Chen Yunqi, Cai Yimao, Wang Yu, et al. Integrated circuit technology: Future development and key issues–review of the 347th Shuangqing Forum (Youth)[J]. SCIENTIA SINICA Informationis, 2024, 54(1): 1−15 (in Chinese)

    [3]

    Xiang Chengxiang, Yang Yongan, Penner R M. Cheating the diffraction limit: Electrodeposited nanowires patterned by photolithography[J]. Chemical Communications, 2009, 8: 859−873

    [4]

    Chaudhry A, Kumar M J. Controlling short-channel effects in deep-submicron SOI MOSFETs for improved reliability: A review[J]. IEEE Transactions on Device and Materials Reliability, 2004, 4(1): 99−109

    [5]

    Thimbleby H. Modes, WYSIWYG and the von Neumann bottleneck[C]//Proc of IEE Colloquium on Formal Methods and Human-Computer Interaction: II. London: IET, 1988: 4/1−4/5

    [6]

    Zhou Zhihua. Machine Learning[M]. Singapore: Springer Nature Singapore , 2021

    [7] 梁云,卓成,李永福. EDA左移融合设计范式的发展现状、趋势与挑战[J]. 中国科学:信息科学,2024,54(1):121−129

    Liang Yun, Zhuo Cheng, Li Yongfu. The shift-left design paradigm of EDA: Progress and challenges[J]. SCIENTIA SINICA Informationis, 2024, 54(1): 121−129 (in Chinese)

    [8] 包云岗,常轶松,韩银和,等. 处理器芯片敏捷设计方法:问题与挑战[J]. 计算机研究与发展,2021,58(6):1131−1145

    Bao Yungang, Chang Yisong, Han Yinhe, et al. Agile design of processor chips: Issues and challenges[J]. Journal of Computer Research and Development, 2021, 58(6): 1131−1145 (in Chinese)

    [9]

    Scheffer L, Lavagno L. EDA for IC System Design, Verification, and Testing[M]. FL: CRC Press, Inc, 2018

    [10]

    Wu C M, Shieh M D, Wu C H, et al. VLSI architectural design tradeoffs for sliding-window log-MAP decoders[J]. IEEE Transactions on Very Large Scale Integration Systems, 2005, 13(4): 439−447

    [11] Brown S, Vranesic Z. 数字逻辑基础与Verilog设计[M]. 夏宇闻,须毓孝译. 原书第2版. 北京:机械工业出版社,2008

    Brown S, Vranesic Z. Fundamentals of Digital Logic with Verilog Design[M]. Translated by Xia Yuwen, Xu Yuxiao. 2nd. Beijing: China Machine Press, 2008 (in Chinese)

    [12]

    Rudell R L. Logic synthesis for VLSI design[R/OL]. Berkeley, California: University of California, Berkeley, 1989. [2023-12-25]. https://www2.eecs.berkeley.edu/Pubs/TechRpts/1989/1223.html

    [13]

    Sherwani N A. Algorithms for VLSI Physical Design Automation[M]. New York: Springer Science & Business Media New York, 2013

    [14]

    Celio C, David A P, Krste A. The Berkeley out-of order machine (BOOM): An industry-competitive, synthesizable, parameterized RISC-V processor[R]. Berkeley, CA: EECS Department, University of California, Berkeley, 2015

    [15]

    Zhao J, Abraham G. SonicBOOM: The 3rd generation Berkeley out-of-order machine[C]// Proc of 4th Workshop Computer Architecture Research with RISC-V. New York: ACM, 2020:1−7

    [16]

    Asanovic K, Rimas A, Jonathan B, et al. The rocket chip generator[R]. Berkeley, CA: EECS Department, University of California, Berkeley, 2015

    [17]

    Chen Chen, Xiang Xiaoyan, Liu Chang, et al. , Xuantie-910: A commercial multi-core 12-stage pipeline out-of-order 64-bit high performance RISC-V processor with vector extension: Industrial product[C]//Proc of ACM/IEEE Annual Int Symp on Computer Architecture. New York: ACM, 2020: 52−64

    [18] 徐易难,余子濠,王凯帆,等. 香山开源高性能RISC-V处理器设计与实现[J]. 计算机研究与发展,2023,60(3):476−493 doi: 10.7544/issn1000-1239.202221036

    Xu Yinan, Yu Zihao, Wang Kaifan, et al. XiangShan Open-source high performance RISC-V processor design and implementation[J]. Journal of Computer Research and Development, 2023, 60(3): 476−493 (in Chinese) doi: 10.7544/issn1000-1239.202221036

    [19]

    Bachrach J, Vo H, Richards B, et al. Chisel: Constructing hardware in a scala embedded language[C]//Proc of DAC Design Automation Conf. Piscataway, NJ: IEEE , 2012: 1212−1221

    [20]

    Winston P H. Artificial Intelligence[M]. London: Addison-Wesley Longman Publishing Co. , Inc. , 1984

    [21]

    Rapp M, Amrouch H, Lin Yibo, et al. MLCAD: A survey of research in machine learning for CAD keynote paper[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2021, 41(10): 3162−3181

    [22]

    Synopsys. PrimeTime[CP/OL]. [2023-12-25]. https://www.synopsys.com/implementation-and-signoff/signoff/primetime.html

    [23]

    Nesset S R. RTL Power Estimation Flow and Its Use in Power Optimization[M]. Norway: Norwegian University of Science and Technology, 2018

    [24]

    Brooks D, Tiwari V, Martonosi M. Wattch: A framework for architectural-level power analysis and optimizations[C]//Proc of IEEE/ACM Annual Int Symp on Computer Architecture. New York: ACM, 2000: 83−94

    [25]

    Thoziyoor S, Ahn J H, Monchiero M, et al, A comprehensive memory modeling tool and its application to the design and analysis of future memory hierarchies[C]//Proc of Int Symp on Computer Architecture. Piscataway,NJ: IEEE, 2008: 51-62

    [26]

    Li Sheng, Ahn J H, Strong R D, et al. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures[C]//Proc of IEEE/ACM Int Symp on Microarchitecture. New York: ACM, 2009: 469−480

    [27]

    Burger D, Todd M A. The SimpleScalar tool set, version 2.0[J]. ACM SIGARCH Computer Architecture News, 1997, 25: 13−25

    [28]

    Alec R, Mircea R S. RISC5: Implementing the RISC-V ISA in gem5[C]//Proc of the 1st Workshop on Computer Architecture Research with RISC-V. Piscataway, NJ: IEEE, 2017: 1−7

    [29]

    Binkert N L, Dreslinski R G, Hsu L R, et al. The M5 simulator: Modeling networked systems [J]. IEEE Micro, 2006, 26(4): 52−60

    [30]

    Carlson T E, Heirman W, Eeckhout L. Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulation[C]//Proc of Int Conf for High Performance Computing, Networking, Storage and Analysis. Piscataway, NJ: IEEE, 2011: 1−12

    [31]

    Semiconductor industries association. model for assessment of CMOS technologies and roadmaps (MASTAR)[EB/OL]. [2023-12-25] https://web.archive.org/web/20130709053354/http://www.itrs.net/models.html

    [32]

    Brooks D, Bose P, Srinivasan V, et al. New methodology for early-stage, microarchitecture-level power-performance analysis of microprocessors[J]. IBM Journal of Research and Development, 2003, 47((5/6): ): 653−670

    [33]

    Wang Hangsheng, Zhu Xinping, Li-Shiuan P, et al. Orion: A power-performance simulator for interconnection networks[C]//Proc of IEEE/ACM Int Symp on Microarchitecture. Piscataway, NJ: IEEE, 2002: 294−305

    [34]

    Xi S L, Jacobson H, Bose P, et al. Quantifying sources of error in McPAT and potential impacts on architectural studies[C]//Proc of IEEE Int Symp on High Performance Computer Architecture. Piscataway, NJ: IEEE, 2015: 577−589

    [35]

    Lee W, Kim Y, Ryoo J H, et al. PowerTrain: A learning-based calibration of McPAT power models[C]//Proc of IEEE Int Symp on Low Power Electronics and Design. Piscataway, NJ: IEEE, 2015: 189−194

    [36]

    Tang A, Yang Y, Lee C Y et al. McPAT-PVT: Delay and power modeling framework for FinFET processor architectures under PVT variations[J]. IEEE Transactions on Very Large Scale Integration Systems, 2015, 23(9): 1616−1627

    [37]

    Guler A, Jha N K. McPAT-Monolithic: An area/power/timing architecture modeling framework for 3-D hybrid monolithic multicore systems[J]. IEEE Transactions on Very Large Scale Integration Systems, 2020, 28(10): 2146−2156

    [38]

    Ravipati D P, Van S, Victor M, et al. Performance and energy studies on NC-FinFET cache-based systems with FN-McPAT[J]. IEEE Transactions on Very Large Scale Integration Systems, 2023, 31(9): 1280−1293

    [39]

    Van den Steen S, De Pestel S, Mechri M, et al. Micro-architecture independent analytical processor performance and power modeling[C]//Proc of IEEE Int Symp on Performance Analysis of Systems and Software. Piscataway, NJ: IEEE, 2015: 32−41

    [40]

    Park Y H, Pasricha S, Kurdahi F J , et al. A multi-granularity power modeling methodology for embedded processors[J]. IEEE Transactions on Very Large Scale Integration Systems, 2010, 19(4): 668−681

    [41]

    Ansys. PowerArtist[CP/OL]. [2023-12-25]. https://www.ansys.com/products/semiconductors/ansys-powerartist

    [42]

    Mentor. PowerPro RTL low-power[CP/OL]. [2023-12-25]. https://www.mentor.com/hls-lp/powerpro-rtl-low-power/

    [43]

    Bogliolo A, Benini L, De Micheli G. Regression-based RTL power modeling[J]. ACM Transactios on Design Automation of Electronic Systems, 2000, 5(3): 337−372

    [44]

    Sunwoo D, Wu G Y, Patil N A. PrEsto: An FPGA-accelerated power estimation methodology for complex systems[C]//Proc of IEEE Int Conf on Field Programmable Logic and Applications. Piscataway, NJ: IEEE, 2010: 310−317

    [45]

    Yang Jianlei, Ma Liwei, Zhao Kang, et al. Early stage real-time SoC power estimation using RTL instrumentation[C]//Proc of IEEE/ACM Asia and South Pacific Design Automation Conf. Piscataway, NJ: IEEE, 2015: 779−784

    [46]

    Zhou Yuan, Ren Haoxing, Zhang Yanqing, et al. PRIMAL: Power inference using machine learning [C]//Proc of ACM/IEEE Design Automation Conf. New York: ACM, 2019: 1−6

    [47]

    Kim D, Zhao J, Bachrach J, et al. Simmani: Runtime power modeling for arbitrary RTL with automatic signal selection[C]//Proc of IEEE/ACM Int Symp on Microarchitecture. Piscataway, NJ: IEEE, 2019: 1050−1062

    [48]

    Zhang Yanqing, Ren Haoxing, Khailany B. GRANNITE: Graph neural network inference for transferable power estimation[C]//Proc of ACM/IEEE Design Automation Conf. New York: ACM, 2020: 1−6

    [49]

    Xie Zhiyao, Xu Xiaoqing, Walker M, et al. APOLLO: An automated power modeling framework for runtime power introspection in high-volume commercial microprocessors[C]//Proc of IEEE/ACM Int Symp on Microarchitecture. Piscataway, NJ: IEEE, 2021: 1−14

    [50]

    Fang Wenji, Lu Yao, Liu Shang, et al. MasterRTL: A pre-synthesis PPA estimation framework for any RTL design[C]//Proc of IEEE/ACM Int Conf on Computer Aided Design. Piscataway, NJ: IEEE, 2023: 1−9

    [51]

    Lee B C, Brooks D M. Illustrative design space studies with microarchitectural regression models[C]//Proc of IEEE Int Symp on High Performance Computer Architecture. Piscataway, NJ: IEEE, 2007: 340−351

    [52]

    Jacobson H, Buyuktosunoglu A, Bose P, et al. Abstraction and microarchitecture scaling in early-stage power modeling[C] // Proc of IEEE Int Symp on High Performance Computer Architecture. Piscataway, NJ: IEEE, 2011: 394−405

    [53]

    Bircher W L, John L K. Complete system power estimation: A trickle-down approach based on performance events[C] // Proc of IEEE Int Symp on Performance Analysis of Systems & Software. Piscataway, NJ: IEEE, 2007: 158−168

    [54]

    Walker M J, Diestelhorst S, Hansson A, et al. Accurate and stable run-time power modeling for mobile and embedded CPUs[J]. IEEE Transactios on Computer-Aided Design of Integrated Circuits and Systems, 2017, 36(1): 106−119

    [55]

    Sagi M, Doan N A V, Rapp M, et al. A lightweight nonlinear methodology to accurately model multicore processor power[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2020, 39(11): 3152−3164

    [56]

    Lebeane M, Ryoo J H , Panda R, et al. WattWatcher: Fine-grained power estimation for emerging workloads[C]//Proc of Int Symp on Computer Architecture and High Performance Computing. New York: ACM, 2015: 106−113

    [57]

    Reddy B K, Walker M J , Balsamo D et al. Empirical CPU power modelling and estimation in the gem5 simulator[C] // Proc of IEEE Int Workshop on Power and Timing Modeling, Optimization and Simulation. Piscataway, NJ: IEEE, 2017: 1−8

    [58]

    Ipek E, McKee S A, Caruana R, et al. Efficiently exploring architectural design spaces via predictive modeling[C]//Proc of ACM Int Conf on Architectural Support for Programming Languages and Operating Systems. New York: ACM, 2006: 195−206

    [59]

    Ipek E, McKee S A, Singh K, et al. Efficient architectural design space exploration via predictive modeling[J]. ACM Transactions on Architecture and Code Optimization, 2008, 4(4): 1−34

    [60]

    Kumar A K A, Al-Salamin S, Amrouch H, et al. Machine learning-based microarchitecturelevel power modeling of CPUs[J]. IEEE Transactions on Computers, 2023, 72(4): 941−956

    [61]

    Wilson S Verilator [CP/OL]. [2023-12-25]. https://www.veripool.org/wiki/verilator

    [62]

    Rossi D, Conti F, Marongiu A, et al. PULP: A parallel ultra low power platform for next generation IoT applications[C]//Proc of IEEE Hot Chips Symp. Piscataway, NJ: IEEE, 2015: 1−39

    [63]

    Zhai Jianwang, Bai Chen, Zhu Binwu, et al. McPAT-Calib: A microarchitecture power modeling framework for modern CPUs[C]//Proc of IEEE/ACM Int Conf on Computer-Aided Design. Piscataway, NJ: IEEE, 2021: 1−9

    [64]

    Zhai Jianwang, Bai Chen, Zhu Binwu, et al. McPAT-Calib: A RISC-V BOOM microarchitecture power modeling framework[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2023, 42(1): 243−256

    [65]

    Zhang Qijun, Li Shiyu, Zhou Guanglei, et al. PANDA: Architecture-level power evaluation by unifying analytical and machine learning solutions[C]//Proc of IEEE/ACM Int Conf on Computer Aided Design. Piscataway, NJ: IEEE, 2021: 1−9

    [66]

    Zhai Jianwang, Cai Yici, Yu Bei. Microarchitecture power modeling via artificial neural network and transfer learning[C]//Proc of IEEE/ACM Asia and South Pacific Design Automation Conf. Piscataway, NJ: IEEE, 2023: 1−6

    [67]

    Wang Duo, Yan Mingyu, Teng Yihan, et al. A Transfer learning framework for high-accurate cross-workload design space exploration of CPU[C]//Proc of IEEE/ACM Int Conf on Computer Aided Design. Piscataway, NJ: IEEE, 2023: 1−9

    [68]

    Li Fuping, Wang Ying, Liu Cheng et al. NoCeption: A fast PPA prediction framework for network-on-chips using graph neural network[C]//Proc of Design, Automation & Test in Europe Conf & Exhibition. Piscataway, NJ: IEEE, 2022: 1035−1040

    [69]

    Guo Qi, Chen Tianshi, Chen Yunji, et al. Accelerating architectural simulation via statistical techniques: A survey[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2016, 35(3): 433−446

    [70]

    Karkhanis T S, Smith J E. A first-order superscalar processor model[C]//Proc of IEEE/ACM Int Symp on Computer Architecture. Piscataway, NJ: IEEE, 2004: 338−349

    [71]

    Karkhanis T S, Smith J E. Automated design of application specific superscalar processors: An analytical approach[C]//Proc of IEEE/ACM Int Symp on Computer Architecture. Piscataway, NJ: IEEE, 2007: 402−411

    [72]

    Lee J, Jang H, Kim J. RPStacks: Fast and accurate processor design space exploration using representative stall-event stacks[C]//Proc of IEEE/ACM Int Symp on Microarchitecture. Piscataway, NJ: IEEE, 2014: 255−267

    [73]

    Bai Chen, Huang Jiayi, Wei Xuechao, et al. ArchExplorer: Microarchitecture exploration via bottleneck analysis[C]//Proc of Annual IEEE/ACM Int Symp on Microarchitecture. Piscataway, NJ: IEEE, 2023: 268−282

    [74]

    Dubach C, Jones T, O'Boyle M. Microarchitectural design space exploration using an architecture-centric approach[C]//Proc of IEEE/ACM Int Symp on Microarchitecture. Piscataway, NJ: IEEE, 2007: 262−271

    [75]

    Chen Tianshi, Guo Qi, Tang Ke, et al. ArchRanker: A ranking approach to design space exploration[J]. ACM SIGARCH Computer Architecture News, 2014, 42(3): 85−96

    [76]

    Freund Y, Iyer R, Schapire R E, et al. An efficient Boosting algorithm for combining preferences[J]. Journal of Machine Learning Research, 2003, 4(9): 933−969

    [77]

    Li Dandan, Yao Shuzhen, Liu Yuhang, et al. Efficient design space exploration via statistical sampling and AdaBoost learning[C]//Proc of ACM/IEEE Design Automation Conf. New York: ACM, 2016: 1−6

    [78]

    Bai Chen, Sun Qi, Zhai Jianwang, et al. BOOM-Explorer: RISC-V BOOM microarchitecture design space exploration framework[C]//Proc of IEEE/ACM Int Conf on Computer-Aided Design. Piscataway, NJ: IEEE, 2021: 1−9

    [79]

    Bai Chen, Sun Qi, Zhai Jianwang, et al. BOOM-Explorer: RISC-V BOOM microarchitecture design space exploration framework[J]. ACM Transactions on Design Automation of Electronic Systems, 2024, 29(1): 1−23

    [80]

    Bai Chen, Zhai Jianwang, Ma Yuzhe, et al. Towards automated RISC-V microarchitecture design with reinforcement learning[C]//Proc of AAAI Conf on Artificial Intelligence. Menlo, CA: AAAI, 2024: 1−9

    [81]

    Eyerman S, Eeckhout L, Karkhanis T, et al. A mechanistic performance model for superscalar out-of-order processors[J]. ACM Transactions on Computer Systems, 2009, 27(2): 1−37

    [82]

    Zhai Jianwang, Cai Yici. Microarchitecture design space exploration via Pareto-driven active learning[J]. IEEE Transactions on Very Large Scale Integration Systems, 2023, 31(11): 1727−1739

    [83]

    Yu Ziyang, Bai Chen, Hu Shoubo, et al. IT-DSE: Invariance risk minimized transfer microarchitecture design space exploration[C]//Proc of IEEE/ACM Int Conf on Computer Aided Design. Piscataway, NJ: IEEE, 2023: 1−9

    [84]

    Yi Xiaoling, Lu Jialin, Xiong Xiankui, et al. Graph representation learning for microarchitecture design space exploration[C]//Proc of ACM/IEEE Design Automation Conf. New York: ACM, 2023: 1−6

    [85]

    Zhang Muhan, Jiang Shali, Cui Zhicheng, et al. D-VAE: A variational autoencoder for directed acyclic graphs[J]. arXiv preprint, arXiv: 1904.11088, 2019

    [86]

    Wang Duo, Yan Mingyu, Teng Yihan, et al. A high-accurate multi-objective ensemble exploration framework for design space of CPU microarchitecture[C]//Proc of the Great Lakes Symp on VLSI. New York: ACM, 2023: 379–383

    [87]

    Wang Duo, Yan Mingyu, Teng Yihan, et al. A high-accurate multi-objective exploration framework for design space of CPU[C] // Proc of ACM/IEEE Design Automation Conf. Piscataway, NJ: IEEE, 2023: 1−6

    [88]

    Wang Duo, Yan Mingyu, Teng Yihan, et al. MoDSE: A high-accurate multi-objective design space exploration framework for CPU microarchitectures[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and System, 2024, 43(5): 1525−1537

    [89]

    Esmaeilzadeh H, Ghodrati S, Kahng A B, et al. An Open-source ML-based full-stack optimization framework for machine learning accelerators[J]. arXiv preprint, arXiv: 2308.12120, 2023

    [90]

    Chen Shixin, Zheng Su, Bai Chen, et al. SoC-Tuner: An importance-guided exploration framework for DNN-targeting SoC design[C] // Proc of IEEE/ACM Asian and South Pacific Design Automation Conf. Piscataway, NJ: IEEE, 2024: 1−6

    [91]

    Genc H, Kim S, Amid A, et al. Gemmini: Enabling systematic deep-learning architecture evaluation via full-stack integration[C] // Proc of ACM/IEEE Design Automation Conf. New York: ACM, 2021: 769–774

    [92]

    Li Sicheng, Bai Chen, Wei Xuechao, et al. 2022 ICCAD CAD contest problem C: Microarchitecture design space exploration[C] // Proc of IEEE/ACM Int Conf on Computer-Aided Design. Piscataway, NJ: IEEE, 2022: 1−7

    [93]

    Bai chen. ICCAD contest platform [EB/OL]. [2024-01-02]. http://47.93.191.38/

  • 期刊类型引用(1)

    1. 石明丰,甘永根,赵玉珂,刘飞飞,何晓蓉. 智能量测开关与智能物联锁具信息交互设计. 中国新技术新产品. 2024(21): 137-139 . 百度学术

    其他类型引用(1)

图(19)  /  表(4)
计量
  • 文章访问数:  828
  • HTML全文浏览量:  200
  • PDF下载量:  231
  • 被引次数: 2
出版历程
  • 收稿日期:  2024-01-31
  • 修回日期:  2024-03-18
  • 网络出版日期:  2024-04-14
  • 刊出日期:  2024-05-31

目录

/

返回文章
返回