Processing math: 0%
  • 中国精品科技期刊
  • CCF推荐A类中文期刊
  • 计算领域高质量科技期刊T1类
高级检索

机器学习辅助微架构功耗建模和设计空间探索综述

翟建旺, 凌梓超, 白晨, 赵康, 余备

翟建旺, 凌梓超, 白晨, 赵康, 余备. 机器学习辅助微架构功耗建模和设计空间探索综述[J]. 计算机研究与发展, 2024, 61(6): 1351-1369. DOI: 10.7544/issn1000-1239.202440074
引用本文: 翟建旺, 凌梓超, 白晨, 赵康, 余备. 机器学习辅助微架构功耗建模和设计空间探索综述[J]. 计算机研究与发展, 2024, 61(6): 1351-1369. DOI: 10.7544/issn1000-1239.202440074
Zhai Jianwang, Ling Zichao, Bai Chen, Zhao Kang, Yu Bei. Machine Learning for Microarchitecture Power Modeling and Design Space Exploration:A Survey[J]. Journal of Computer Research and Development, 2024, 61(6): 1351-1369. DOI: 10.7544/issn1000-1239.202440074
Citation: Zhai Jianwang, Ling Zichao, Bai Chen, Zhao Kang, Yu Bei. Machine Learning for Microarchitecture Power Modeling and Design Space Exploration:A Survey[J]. Journal of Computer Research and Development, 2024, 61(6): 1351-1369. DOI: 10.7544/issn1000-1239.202440074
翟建旺, 凌梓超, 白晨, 赵康, 余备. 机器学习辅助微架构功耗建模和设计空间探索综述[J]. 计算机研究与发展, 2024, 61(6): 1351-1369. CSTR: 32373.14.issn1000-1239.202440074
引用本文: 翟建旺, 凌梓超, 白晨, 赵康, 余备. 机器学习辅助微架构功耗建模和设计空间探索综述[J]. 计算机研究与发展, 2024, 61(6): 1351-1369. CSTR: 32373.14.issn1000-1239.202440074
Zhai Jianwang, Ling Zichao, Bai Chen, Zhao Kang, Yu Bei. Machine Learning for Microarchitecture Power Modeling and Design Space Exploration:A Survey[J]. Journal of Computer Research and Development, 2024, 61(6): 1351-1369. CSTR: 32373.14.issn1000-1239.202440074
Citation: Zhai Jianwang, Ling Zichao, Bai Chen, Zhao Kang, Yu Bei. Machine Learning for Microarchitecture Power Modeling and Design Space Exploration:A Survey[J]. Journal of Computer Research and Development, 2024, 61(6): 1351-1369. CSTR: 32373.14.issn1000-1239.202440074

机器学习辅助微架构功耗建模和设计空间探索综述

基金项目: 国家重点研发计划项目(2022YFB2901100);香港特别行政区研究资助局(CUHK14210723);北京市自然科学基金项目(4244107)
详细信息
    作者简介:

    翟建旺: 1996年生. 博士,特聘副研究员. 主要研究方向为机器学习辅助的EDA算法,包括微架构功耗建模、设计空间探索、物理设计

    凌梓超: 2000年生. 学士. 主要研究方向为计算机体系结构、功耗建模

    白晨: 1998年生. 博士研究生.主要研究方向为计算机体系结构、电子设计自动化

    赵康: 1982年生.博士,教授,博士生导师. CCF高级会员.主要研究方向为电子设计自动化、面向FPGA的编译优化、异构计算系统

    余备: 1983年生. 博士,副教授,博士生导师. 主要研究方向为电子设计自动化、机器学习

    通讯作者:

    赵康(zhaokang@bupt.edu.cn

  • 中图分类号: TP332

Machine Learning for Microarchitecture Power Modeling and Design Space Exploration:A Survey

Funds: This work was supported by the National Key Research and Development Program of China (2022YFB2901100), the Research Grants Council of Hong Kong SAR (CUHK14210723), and the Beijing Natural Science Foundation (4244107).
More Information
    Author Bio:

    Zhai Jianwang: born in 1996. PhD, assistant professor. His main research interests include machine learning-assisted electronic design automation (EDA) algorithms, including microarchitecture power modeling, design space exploration, and physical design

    Ling Zichao: born in 2000. Bachelor. His main research interests include computer architecture and power modeling

    Bai Chen: born in 1998, PhD candidate. His main research interests include computer architecture and electronic design automation

    Zhao Kang: born in 1982, PhD, professor, PhD supervisor. Senior member of CCF. His main research interests include electronic design automation (EDA), compiling optimization for FPGA, and heterogenous computing systems

    Yu Bei: born in 1983. PhD, associate professor, PhD supervisor. His main research interests include electronic design automation (EDA) and machine learning

  • 摘要:

    微架构设计是处理器开发的关键阶段,处在整个设计流程的上游,直接影响性能、功耗、成本等核心设计指标. 在过去的数十年中,新的微架构设计方案,结合半导体制造工艺的进步,使得新一代处理器能够实现更高的性能和更低的功耗、成本. 然而,随着集成电路发展至“后摩尔时代”,半导体工艺演进所带来的红利愈发有限,功耗问题已成为高能效处理器设计的主要挑战. 与此同时,现代处理器的架构愈发复杂、设计空间愈发庞大,设计人员期望进行快速精确的指标权衡以获得更理想的微架构设计. 此外,现有的层层分解的设计流程极为漫长耗时,已经难以实现全局能效最优. 因此,如何在微架构设计阶段进行精确高效的前瞻性功耗估计和探索优化成为关键问题. 为了应对这些挑战,机器学习技术被引入到微架构设计流程中,为处理器的微架构建模和优化提供了高质量方案. 首先介绍了处理器的主要设计流程、微架构设计及其面临的挑战,然后阐述了机器学习辅助集成电路设计,重点在于使用机器学习技术辅助微架构功耗建模和设计空间探索的研究进展,最后进行总结展望.

    Abstract:

    Microarchitecture design is a key stage of processor development. It is at the upper level of the entire design flow and directly affects core metrics such as performance, power consumption, and cost. Over the past few decades, new microarchitecture solutions, coupled with advances in semiconductor manufacturing, have enabled newer generations of processors to achieve higher performance, lower power consumption and cost. However, as chip design enters the post-Moore era, the dividends from the evolution of semiconductor technology are increasingly limited, and power consumption has become a major challenge for energy-efficient processor design. Meanwhile, modern processors are becoming more complex in architecture and the design space is larger, requiring designers to make accurate design metrics tradeoffs to achieve the most desirable microarchitecture design. Moreover, the existing stage-by-stage decomposition of the development and validation flow is extremely lengthy and time-consuming, and it is difficult to achieve global energy efficiency optimization. Therefore, how to perform accurate and efficient power estimation and design space exploration at the microarchitecture design stage becomes a key issue. To tackle these challenges, machine learning has been introduced into the microarchitecture design process, providing efficient and accurate solutions for microarchitecture modeling and optimization. We firstly introduce the main design flow of processors, microarchitecture design and its major challenges, then amplify machine learning-assisted integrated circuit design, which focuses on research advances in the use of machine learning techniques to assist microarchitecture power modeling and design space exploration, and finally conclude with a summary and outlook.

  • 场景流(scene flow, SF)是定义在2个连续场景中表面点的3D运动场,是一个感知动态场景的基本工具.随着自动驾驶、人机交互等应用的大规模商业落地,感知系统需要精确感知环境中的动态运动物体[1-2],因此精确估计场景流成为近年来的研究热点.由LiDAR等3D传感器直接获得的点云数据可以得到场景中点的精确位置信息,因此点云数据被广泛应用于场景流估计任务中.点云数据仅包含3D点坐标,因此在稀疏点、边缘点处会出现特征信息不足的现象,在这些点上的匹配会更加困难,这些匹配困难点严重影响场景流估计的整体准确度.

    近几年的方法都是关注2个连续点云间对应的匹配关系来优化场景流的估计精度,FlowNet3D[3]在单尺度上获取匹配关系;HPLFlowNet[4]使用双边卷积层(bilateral convolutional layer, BCL),在多个尺度上联合计算匹配关系[5];PointPWC-Net[6]在多个尺度上建立用于匹配的代价体(cost volume, CV)和重投影(warping)模块.但这些方法仅考虑了点云间的匹配关系,缺少优化匹配困难点的方式. 如图1(a)所示,图片中的点为场景的一部分,其中红色代表该点的端点误差(end point error, EPE)小于0.05 m;绿色代表该点的端点误差大于等于0.05 m且小于0.3 m;蓝色代表该点的端点误差大于等于0.3 m.在图1(a)虚线框中,PointPWC-Net在一个局部邻域内(一个汽车整体)同时有匹配准确的红色点和匹配困难的蓝色点.本文提出的基于邻域一致性的点云场景流传播更新方法(neighborhood consistency propagation update method,NCPUM)方法根据点云相邻点的相关性,即属于源点云的场景流在足够小的邻域内很大程度上存在一致性,将局部邻域中的准确场景流传播到匹配困难点上.可以有效地减少匹配困难点场景流的误差,提升整体准确度.图1(b)为经过NCPUM方法优化后的效果,可以看到在虚线框内的汽车点和匹配困难的蓝色点消失,匹配较差的绿色点明显减少,匹配准确的红色点明显增多.

    图  1  2种方法的可视化对比
    Figure  1.  Visual comparison of the two methods

    具体来说,NCPUM假设利用点云内相邻点的相关性使场景流估计具有邻域一致性,通过置信度图传播更新提升整体场景流估计的准确度.基于该假设,NCPUM设计了置信度预测模块和场景流传播模块,对骨干网络输出的初始场景流预测置信度图,经过场景流传播模块在具有一致性的邻域内将场景流从高置信度点集向低置信度点集传播,改善邻域内匹配困难点的准确度.本文的贡献有2方面:

    1)根据场景流的邻域一致性设计了场景流传播优化方法NCPUM.该方法使用场景流在局部邻域内传播的方式,改善估计效果.NCPUM的评估结果优于之前的工作,证明了该方法的有效性.

    2)NCPUM在Flyingthings3D和KITTI数据集上的测试结果在准确度上达到国际先进水平,并更换不同的骨干网络进行了测试,验证了NCPUM对于不同的骨干网络都能明显提升其估计准确度.

    在Vedula等人[7]工作中,定义和介绍了场景流的概念,之后许多工作[8-12]在多种类型的数据集上进行场景流的估计,随着最近几年基于点云的深度学习方法[13-15]的发展,可以在点云上直接估计场景流.其中一个使用可学习的深度网络来估计点云上的场景流的方法是FlowNet3D[3],它将下采样的特征进行嵌入,得到点云间的运动信息,通过上采样方法回归得到对应点的场景流.FlowNet3D只在单一尺度上进行了特征的嵌入,单一尺度的感受野无法在大尺度和小尺度运动上同时获得精确的估计结果. HPLFlowNet[4]使用双边卷积在多个尺度上联合计算匹配度,但限于内存使用限制无法在上采样过程中细化场景流.而PointPWC-Net[6]遵循了光流估计的“由粗到细”(coarse to fine, CTF)的范式,在多个尺度的局部范围内使用PointConv[13]建立用于匹配的代价体和重投影的模块.FLOT[16]通过最优传输(optimal transport),优化源点云和目标点云的匹配关系.这些关注于匹配关系的方法得到了很好的场景流估计结果.HALFlow[17]使用注意力机制,嵌入更多的位置信息,获得更准确的场景流估计结果.

    文献[34,6,13,1617]都是通过匹配连续点云间的特征回归出对应源点云的场景流,在匹配困难点处没有额外的优化措施.本文方法在源点云中根据相邻点的相关性,在邻域内改善匹配困难点的场景流,获得优于匹配方法的估计准确度.

    之前的场景流估计工作中都会注重在邻域内提取特征,根据提取到的特征来进行连续点云间的匹配[3-4,6,17-19],回归出点云间的场景流.但这只是在提取的特征上包含了邻域信息,在邻域特征信息不足的点上会出现匹配困难的情况.在同样使用邻域信息进行匹配的任务中[20-21],LiteFlowNet3[20]根据局部光流一致性,在代价体上对邻域内的点进行优化,获得了相对于匹配方法更好的光流估计精度.受该想法的启发,我们合理假设在2个连续场景中,一个合适的邻域内的点共享相同的运动模式,因此在邻域内的场景流具有一致性.NCPUM根据初始场景流显式的估计置信度,在邻域内的高置信度点向低置信度点进行传播更新.与现有方法不同的是,NCPUM的更新操作是在场景流上而非在特征上,所依赖的也不是特征上的相关或者相似,而是点云邻域内场景流的一致性.

    NCPUM从场景流邻域一致性假设出发,构建了一种对场景流在邻域内传播更新的优化方法.具体网络框架如图2所示,分别构建置信度预测模块和场景流传播模块实现NCPUM优化方法.首先是估计初始场景流的骨干网络,在得到初始场景流以及对应特征之后,送入置信度预测模块;然后在置信度预测模块中使用编码器-解码器(encoder-decoder)的网络结构,对输入的场景流进行置信度图的预测,置信度图用来表示各个场景流估计是否准确;最后在场景流传播模块中,根据预测得到的置信度图将场景流从高置信度点集向低置信度点集传播,更新低置信度点的场景流,降低匹配困难点对准确度的影响.

    图  2  网络结构图
    Figure  2.  Network structure

    场景流估计任务的最终目标是估计2个连续点云之间的运动矢量,因此定义2个连续的3D点云场景:源点云P=\left\{{\boldsymbol{x}}_{i}\right|i=\mathrm{1,2},…,{n}_{1}\},和目标点云Q=\left\{{\boldsymbol{y}}_{j}\right| j= \mathrm{1,2},…,{n}_{2}\},其中{\boldsymbol{x}}_{i},{\boldsymbol{y}}_{i}\in {\mathbb{R}}^{3}并且 i 并不必须与 j 相等.源点云 P 中的点运动到目标点云 Q 中的对应点的运动矢量场为{{F}}=({\boldsymbol{f}}_1 … {\boldsymbol{f}}_{n_1}),该运动矢量场即为最终估计的场景流.估计的场景流是基于源点云 P 的,因此场景流与源点云中的点一一对应.

    在估计初始场景流时,使用的是PointPWC-Net作为骨干网络,该方法使用2个连续的点云作为输入,使用特征金字塔的结构,在每个分辨率尺度上都进行一次源点云 P 到目标点云 Q 的重投影,之后进行匹配度代价体的计算,代价体定义了逐点的匹配程度,PointPWC-Net对代价体进行回归得到逐点的场景流.

    在PointPWC-Net中,构建了4个尺度的特征金字塔,在得到4个尺度的点特征后,场景流的估计会从较粗的尺度进行,遵循由粗到细的规范.估计了当前尺度的场景流后,会上采样到更精细的尺度,将上采样的场景流对源点云进行重投影,在当前尺度上对重投影后的点云和对目标点云估计一个相对于上一个尺度场景流的残差,以完成对估计场景流的精细化.将整个重投影过程进行公式化定义:

    {{{P}}}_{\mathrm{w}}=\{{\boldsymbol{p}}_{\mathrm{w},i}={\boldsymbol{p}}_{i}+{\boldsymbol{f}}_{i}|{\boldsymbol{p}}_{i}\in P,{\boldsymbol{f}}_{i}\in {F}^{\mathrm{u}\mathrm{p}}{\}}_{i=1}^{{n}_{1}}, (1)

    其中 P 为源点云, {P}_{\mathrm{w}} 为重投影后的点云, {F}^{\mathrm{u}\mathrm{p}} 为从上一个尺度上采样的场景流.

    在PointPWC-Net中,对2个点云以及对应的特征进行了代价体的构建. 假设 {\boldsymbol{g}}_{i}\in {\mathbb{R}}^{C} 是对应目标点云点 {\boldsymbol{p}}_{i}\in P 的特征, {\boldsymbol{h}}_{j}\in {\mathbb{R}}^{C} 是对应目标点云点 {\boldsymbol{q}}_{i}\in Q 的特征,那么对应2个点之间的匹配度定义为:

    Cost({\boldsymbol{p}}_{i},{\boldsymbol{q}}_{j})=M\left(concat\right({\boldsymbol{g}}_{i},{\boldsymbol{h}}_{j},{\boldsymbol{q}}_{j}-{\boldsymbol{p}}_{i}\left)\right), (2)

    使用多层感知机(multilayer perceptron) M 将2点之间的潜在关系和点与点之间的距离串联后进行学习.在有了点对点的匹配度之后,将其组成当前尺度的代价体,PointPWC-Net根据源点云点到目标点云邻域点的距离对代价体加权,即对于1个源点云的点 {\boldsymbol{p}}_{i}\in P ,得到它在目标点云 Q 上的1个邻域 {N}_{Q}\left({\boldsymbol{p}}_{i}\right) ,再根据目标点云邻域中的每个点到源点云点的距离得到权重C.

    C=\displaystyle \sum _{{\boldsymbol{q}}_{j}\in {N}_{Q}\left({\boldsymbol{p}}_{i}\right)}{W}_{Q}({\boldsymbol{q}}_{j},{\boldsymbol{p}}_{i})Cost({\boldsymbol{q}}_{j},{\boldsymbol{p}}_{i}), (3)
    {W}_{Q}({\boldsymbol{q}}_{j},{\boldsymbol{p}}_{i})=M({\boldsymbol{q}}_{j}-{\boldsymbol{p}}_{i}). (4)

    使用PointPWC-Net估计初始场景流时,沿用了多尺度监督损失,对应估计得到4个尺度的场景流,对场景流真实值采样到同样的尺度,在对应的尺度上做不同权重 \alpha 的2范数监督.

    {Loss}_{\mathrm{s}\mathrm{f}}=\displaystyle \sum _{l={l}_{0}}^{L}{\alpha }_{l}\displaystyle \sum _{\boldsymbol{p}\in P}\left|\right|{{\boldsymbol{F}}}^{l}\left({\boldsymbol{p}}\right)-{{\boldsymbol{F}}}_{\boldsymbol{GT}}^{l}\left(\boldsymbol{p}\right)|{|}_{2}. (5)

    在骨干网络输出初始场景流后,会经过置信度预测模块对初始场景流预测置信度图.该置信度定义为初始场景流相对于真实值的误差,即预测的误差值越小,表示该点在骨干网络中估计的初始场景流越准确,置信度值越高.置信度预测模块首先使用“编码器−解码器”的构造,以初始场景流的3D矢量信息作为输入,在编码器过程中进行点的下采样,可以扩大置信度预测模块的感受野,参考更多的相邻场景流推断置信度;然后在解码器的上采样过程中使用跳跃链接,串联编码过程中对应尺度的特征信息,为上采样提供更多精细尺度的特征,获得更精细的上采样结果,并且考虑骨干网络中输出的场景流特征;最后使用sigmoid函数输出1个(0-1)分布的置信度图,并将该置信度图用于之后的场景流传播模块中.

    置信度预测模块使用的是有监督的训练方式,监督信息是初始场景流与场景流真实值的2范数二值化得到的先验分布图,该分布图为初始场景流相对于真实值的先验误差分布.设定阈值 \theta ,当初始场景流与真实值的2范数小于 \theta 时,设定为0,否则设定为1.由此得到的分布图为场景流先验的二分类分布图,用来监督置信度预测模块的输出.

    {\boldsymbol{GT}}_{{\rm{conf}}}=\left\{\begin{aligned} &0, |{\boldsymbol{F}}-{{\boldsymbol{GT}}}_{{\rm{sf}}}| <\theta ,\\ &1,|{\boldsymbol{F}}-{{\boldsymbol{GT}}}_{{\rm{sf}}}| \geqslant \theta , \end{aligned}\right. (6)
    \begin{aligned} {Loss}_{\mathrm{c}\mathrm{o}\mathrm{n}\mathrm{f}}=\;&-({{\boldsymbol{GT}}}_{\mathrm{c}\mathrm{o}\mathrm{n}\mathrm{f}}\times \mathrm{ln}\;confmap+\\ &(1-{{\boldsymbol{GT}}}_{\mathrm{c}\mathrm{o}\mathrm{n}\mathrm{f}})\times \mathrm{ln}\left(1-confmap\right)),\end{aligned} (7)

    其中 confmap 是置信度预测模块得到的置信图.{{\boldsymbol{GT}}}_{\mathrm{c}\mathrm{o}\mathrm{n}\mathrm{f}}是场景流先验分布图,在式(6)中由初始场景流F和真实值{{\boldsymbol{GT}}}_{\mathrm{sf}}处理得到.估计置信度图的最后一层使用sigmoid函数将其转换为0~1之间的分布,因此就可以使用二分类交叉熵(binary cross entropy, BCE)进行监督.

    经过场景流置信度图的预测,根据场景流置信度得到源点云中的高置信度点集和低置信度点集,由高置信度点集向低置信度点集进行限制半径的传播;根据邻域一致性的假设,如果高置信度点与低置信度点的传播半径不大于传播半径阈值,可以认为两点的场景流拥有一致性,可以使用高置信度点的场景流更新低置信度点的场景流.

    {\boldsymbol{p}}_{2}=KNN\left({\boldsymbol{p}}_{1}\right),\quad {\boldsymbol{p}}_{1},{\boldsymbol{p}}_{2}\in P, (8)

    {\boldsymbol{p}}_{1} {\boldsymbol{p}}_{2} 都属于源点云 P ,因为邻域一致性依赖于点云内相邻点的相关性,所以距离最近的点最有可能存在场景流的邻域一致性,式 \left(8\right) KNN 为K最近邻(K-nearest neighbor)方法采样低置信度点 {\boldsymbol{p}}_{1} 在源点云的最近点.

    f\left({\boldsymbol{p}}_{1}\right)=f\left({\boldsymbol{p}}_{2}\right), \quad {\rm{if}}({\boldsymbol{p}}_{1}-{\boldsymbol{p}}_{2}) < \beta , (9)

    其中 {\boldsymbol{p}}_{1} {\boldsymbol{p}}_{2} 分别为低置信度点和高置信度点, \beta 为传播半径阈值,当两点的距离不大于传播半径时传播更新低置信度点的场景流. 这里传播半径阈值非常重要,点云中的相邻点只有空间距离在一定阈值内才会具有相关性,在点密度足够的情况下,对于小邻域内的点的场景流具有一致性,这个邻域的半径阈值设置不同的数值会影响到优化结果.

    NCPUM在优化初始场景流时,会将反传的梯度在初始场景流处截断,即训练置信度预测模块时不会影响到骨干网络PointPWC-Net.

    与之前的工作的训练过程[4, 6, 16]类似,NCPUM在合成数据集Flyingthings3D[22]上训练模型,然后在Flyingthings3D和真实场景数据集KITTI[23]两个数据集上进行测试,将测试结果与其他方法的结果在表1中进行对比.之所以使用这样的训练策略是因为很难在真实场景中确定出来一个场景流真实值,这里使用的KITTI数据集只有142个场景,而合成数据集有更大的数据量可供训练,如Flyingthings3D数据集有19640对点云可以进行训练.在训练之前,遵循HPLFlowNet和PointPWC-Net的方式对数据进行了预处理,点云场景中没有被遮挡的点.

    表  1  NCPUM与其他方法的对比
    Table  1.  Comparison of NCPUM and Other Methods
    数据集方法EPE/mAcc3DS/%Acc3DR/%Outlier3D/%
    Flyingthings3DFlowNet3D[3]0.11441.377.060.2
    HPLFlowNet[4]0.08061.485.542.9
    PointPWC-Net[6]0.05973.892.834.2
    FLOT[16]0.05273.292.735.7
    HALFlow[17]0.04978.594.730.8
    NCPUM0.06076.193.930.7
    KITTIFlowNet3D[3]0.17737.466.852.7
    HPLFlowNet[4]0.11747.877.841.0
    PointPWC-Net[6]0.06972.888.826.5
    FLOT[16]0.05675.590.824.2
    HALFlow[17]0.06276.590.324.9
    NCPUM0.07078.191.522.3
    注:黑体数字表示最优结果.
    下载: 导出CSV 
    | 显示表格

    在接下来的内容中,将介绍NCPUM实现的细节,然后将测试结果与之前的方法进行对比,证明了NCPUM的有效性.并且我们认为Flyingthings3D数据集与KITTI数据集差异较大,将NCPUM在KITTI数据集前100对数据上进行微调,在后42对数据上进行了测试,更换不同骨干网络微调的测试结果在表3中展示,证明NCPUM基于邻域一致性假设的传播更新方法更适用于真实场景,并进行了消融实验,对传播半径阈值进行更多的对比实验.

    表  3  在KITTI数据集上微调测试
    Table  3.  Fine Tuning and Testing on KITTI Dataset
    骨干网络方法EPE/mAcc3DS/%Acc3DR/%Outlier3D/%
    FlowNet3D[3]w/o ft0.17327.660.964.9
    w ft0.10233.670.344.9
    NCPUM0.09438.674.137.1
    PointPWC-Net[6]w/o ft0.06972.888.826.5
    w ft0.04582.795.125.3
    NCPUM0.04387.596.924.3
    FLOT[16]w/o ft0.05675.590.824.2
    w ft0.02989.496.818.0
    NCPUM0.02889.997.017.5
    注:黑体数字表示最优结果;w/o表示with/without;ft表示fine-funing.
    下载: 导出CSV 
    | 显示表格

    NCPUM的训练设置与骨干网络PointPWC-Net相同,对输入的点云采样到8192个点,为其构建4层的特征金字塔,每一层的损失权重α设置为α0 = 0.02,α1 = 0.04,α2 = 0.08,α3 = 0.16,训练时的初始学习率设置为0.001,训练800个周期,每80个周期学习率减少一半.在对KITTI进行微调时,参数设置与NCPUM设置一致,只是将训练周期减少为400个,并且学习率每40个周期减少一半.

    置信度预测模块设置了3个尺度,分别为2048,512,256个点,特征通道数都为64.在经过3次下采样的特征提取之后得到了3个尺度的场景流特征,再经过3次上采样将特征传播到原尺度.前2个尺度的上采样都同时考虑了前一尺度的特征和下采样时对应的特征;后1个尺度的上采样考虑原尺度的场景流,在上采样到原尺度之后,与骨干网络输出的场景流特征串联在一起,经过2个1D卷积,再经过sigmoid得到0~1分布的置信度.

    本文沿用了场景流估计任务一直以来的评价标准[3-4, 6,18],假设 \boldsymbol{F} 代表场景流估值,{\boldsymbol{F}}_{{\boldsymbol{GT}}}代表场景流真实值,各个评价标准的定义为:

    1)EPE3D. \left|\right|{\boldsymbol{F}}-{{\boldsymbol{F}}}_{{\boldsymbol{GT}}}|{|}_{2} 表示点云上的逐点误差平均值.

    2)Acc3DS. EPE3D < 0.05\;\mathrm{m} 或者相对误差 < 5\% 的点占比.

    3)Acc3DR. EPE3D < 0.1\;\mathrm{m} 或者相对误差 < 10\% 的点占比.

    4)utlier3D. EPE3D > 0.3\;\mathrm{m} 或者相对误差 > 10\% 的点占比.

    Flyingthings3D可用于训练的数据集共有19640对点云,可用于测试的数据集有3824对点云.与FlowNet3D、 PointPWC-Net模型方法的设置相同,本文只使用深度小于35 m的点,大部分的前景物体都能包含在这个范围内.对每个点云采样得到8192个点用于训练和测试,训练设置参数在3.1节网络实现细节中进行了介绍.

    表1展示了在Flyingthings3D数据集上NCPUM在2个准确度指标和离群值占比指标上均优于骨干网络PointPWC-Net.尤其是Acc3DS指标,对比PointPWC-Net有2.3%的提升.但在EPE指标上略低于PointPWC-Net.其原因是在场景流传播模块中会使低置信度点的场景流与高置信度点的场景流相等,对于估计偏差大的点,会传播更新一个准确的场景流,对于估计误差小的低置信度点,传播更新操作反而带来了精度上的扰动.但是准确度是统计EPE3D < 0.05 m或者相对误差< 5%的点占比,所以我们的方法能够优化估计偏差大的匹配困难点,提升准确度.本文在表2中整理出更新点的统计信息(图2\hat{{\boldsymbol{F}}}F的比较),其中包含更新点占全部点的比例(更新点占比)、更新点中精度提升的占比(改进点占比)、更新点中精度下降的占比(扰动点占比)、精度提升的平均值(改进均值)、精度下降的平均值(扰动均值).可以看到,有一半以上的点其实是产生了扰动,并且产生扰动的均值大于改进的均值,因此在精度上NCPUM确实会产生一定的扰动,但是在准确度指标和离群值占比指标上有大幅提升.

    表  2  Flyingthings3D数据集更新点统计信息
    Table  2.  Statistical Information of Update Points on Flyingthings3D Dataset
    方法更新点占比/%改进点占比/%扰动点占比/%改进均值/m扰动均值/mEPE/mAcc3DS/%Acc3DR/%Outlier3D/%
    PointPWC-Net0.05973.892.834.2
    NCPUM9.7346.9853.020.0030.0040.06076.193.930.7
    注:黑体数字表示最优结果.
    下载: 导出CSV 
    | 显示表格

    KITTI Scene Flow 2015数据集包括200个训练场景和200个测试场景,遵循之前工作的方法,在具有真实值并且可用的142对点云数据集上进行测试,并且与之前的方法保持一致[4,6],删除了点云中高度小于0.3 m的地面点.在KITTI数据集上,先使用在Flyingthings3D数据集上训练的模型直接对KITTI数据集进行测试而不经过任何微调,以评估NCPUM的泛化能力.从表1中可以看到NCPUM在Acc3DS和Acc3DR评价指标上优于PointPWC-Net,在Acc3DS指标上有5.3%的提升,在Acc3DR指标上有2.7%的提升,提升幅度大于Flyingthings3D数据集,表现出优秀的泛化性能.不经过微调而在真实数据集上表现更加优秀的原因在于真实数据集KITTI的点云数据物体之间的距离大于Flyingthings3D数据集中物体之间的距离,NCPUM的邻域一致性假设更适用于这种数据特性,所以NCPUM在真实场景上会有更加优秀的表现.本文统计了表1中FLOT和NCPUM的场景流估计时间,FLOT在一秒钟内可以处理2.15对连续点云,而NCPUM在一秒钟内可以处理5.09对连续点云,NCPUM处理速度约是FLOT的2.37倍.在真实使用场景中,准确场景流在总体场景流中的占比比场景流的绝对平均误差值更有意义,拥有更多的准确场景流代表该方法为真实应用带来更好的稳定性.NCPUM在Acc3DS和Acc3DR准确度指标上都有可观的提升,尤其在真实数据集上的场景流Acc3DS指标超过PointPwc的Acc3DS指标7.28%,超过HALFlow的Acc3DS最佳结果2.09%,对比之前的方法,NCPUM的处理速度和准确度表现出更大的应用潜力.

    因为Flyingthings3D数据集和KITTI数据集存在较大的差异,直接使用在Flyingthings3D数据集上预训练的模型在KITTI数据集上测试并不能完全展示NCPUM在KITTI数据集上的性能.所以本文将KITTI数据集拆分,使用前100对场景微调NCPUM,并在剩余的42对场景上进行测试.分别将FlowNet3D,PointPWC-Net, FLOT 3种骨干网络在KITTI数据集上进行微调,然后进行NCPUM的微调,将微调后的3种骨干网络做对比.在微调之后,3种骨干网络的NCPUM可以获得更好的效果,如表3所示,微调后的NCPUM对比微调后对应的骨干网络,在4个评价标准上都有提升,与泛化能力测试结果不同的是,NCPUM在EPE指标上也有了一定的提升,我们认为,Flyingthings3D是虚拟数据集,场景中物体间的距离较小,对某一物体边缘的低置信度点传播更新,可能采样到距离较近的其他物体上的准确点,而不是同一物体中距离较远的准确点,例如图3所示,绿色点为更新场景流的低置信度点,红色的点是传播场景流的高置信度点,黄色的连线为传播更新关系.在图3(a)和图3(b)中都出现了采样到其他物体的现象;KITTI是真实数据集,物体之间的距离较大,如图3(c)所示,不容易出现采样错误的情况,只有在如图3(d)中的远端离群点上可能出现采样不准确的情况,因此KITTI相较于Flyingthings3D数据集更容易符合邻域一致性的假设.

    图  3  Flyingthings3D与KITTI替换细节对比
    Figure  3.  Comparison of Flyingthings3D and KITTI on replacement details

    因为NCPUM是基于邻域一致性假设进行构建的,因此传播半径阈值设置十分重要,不同的半径阈值设置带来的效果是不一样的,甚至不合适的半径阈值会降低NCPUM优化效果.当半径阈值设置过大时,高置信度点的场景流会传播到不具有一致性的低置信度点上,出现扰动;当半径设置过小时,只会有少部分低置信度点会被更新.数据集的不同也会影响到传播半径的设置,对比虚拟数据集,真实数据集因为物体间的距离更大,更容易设置一个合适的传播半径进行传播,这也是NCPUM泛化到真实数据集表现更好的原因.表4对2个数据集上设置的不同传播半径进行对比,NCPUM在Flyingthings3D数据集上的半径设置为0.4时达到最好效果,而在KITTI数据集上的半径设置为3.0时达到最好效果.这个数据的差异表现出在真实场景上对传播的约束更加小,传播更新可以影响到更多的点,从而带来更好的改进效果.

    表  4  NCPUM在不同半径阈值下的测试
    Table  4.  NCPUM Tested with Different Radius Threshold
    数据集半径阈值EPE/mAcc3DS/%Acc3DR/%Outlier3D/%
    Flyingthings3D1.00.06275.293.531.7
    0.40.06076.193.930.7
    0.20.06076.093.830.7
    KITTI5.00.05485.895.924.1
    3.00.04387.596.924.3
    1.00.04385.095.425.5
    注:黑体数字表示最优结果.
    下载: 导出CSV 
    | 显示表格

    为了证明NCPUM方法的泛化性能,本文尝试在不同的骨干网络上进行优化.我们分别以FlowNet3D,PointPWC-Net,FLOT为骨干网络构建使用置信度预测模块和场景流传播模块,在KITTI数据集上进行微调和使用NCPUM优化方法改进.测试结果如表3所示,在对FlowNet3D, PointPWC-Net,FLOT方法使用NCPUM优化方法后,4个指标上都有明显的提升,展示了NCPUM优化方法对不同骨干网络的泛化性.

    图4中可视化了NCPUM的传播更新过程,绿色点为更新场景流的低置信度点,红色点是传播场景流的高置信度点,黄色的连线为传播更新关系.可以看到KITTI数据集中具有一致性的汽车表面会出现估计不准确的绿色低置信度点,这些点更多是位于距离激光雷达更远的低密度点和邻域信息单一的边缘点上,若只关注连续点云间匹配的方法容易在这些点上出现较大的误差,NCPUM对匹配困难点寻找一个匹配准确点进行更新,和相邻的准确点保持一致,从而提高整体的估计准确度;同时传播过程要限制传播半径阈值,避免引入扰动.

    图  4  替换细节可视
    Figure  4.  Visualization of replacemental details

    本文提出了一种根据邻域一致性的传播更新方法NCPUM来优化估计场景流的精度.该方法通过置信度预测模块对骨干网络估计出的初始场景流预测置信度图来判断场景流准确度;然后通过场景流传播模块在有限制的传播半径内从匹配困难点寻找匹配准确点,将匹配准确点的场景流传播到匹配困难点上,以提高场景流的估计精度.NCPUM在不同的数据集Flyingthings3D和KITTI上都体现出了优于之前工作的性能.并且通过在真实数据集上的微调实验和不同传播半径的实验展现出NCPUM在真实场景数据有更加优秀的表现.

    作者贡献声明:郑晗负责提出模型、编写代码、实施实验过程和论文撰写;王宁负责修订和完善论文;马新柱负责数据分析和数据解释;张宏负责理论和实验设计;王智慧负责理论设计;李豪杰提供论文写作指导.

  • 图  1   处理器芯片设计流程示意图

    Figure  1.   Illustration of processor chip design flow

    图  2   RISC-V BOOM微架构示意图

    Figure  2.   Illustration of RISC-V BOOM microarchitecture

    图  3   基于机器学习的EDA出版物数量及占比统计[21]

    Figure  3.   Statistics of numbers and percentage on EDA publications based on machine learning[21]

    图  4   处理器功耗建模方法对比

    Figure  4.   Comparison of processor power modeling methods

    图  5   运行时模型转换为设计时模型[57]

    Figure  5.   Converting runtime models to design-time models[57]

    图  6   Kumar等人[60]提出的功耗建模流程

    Figure  6.   Power modeling flow proposed by Kumar et al[60]

    图  7   PowerTrain示意图[35]

    Figure  7.   Illustration of PowerTrain[35]

    图  8   McPAT-Calib框架流程图[64]

    Figure  8.   Flowchart of McPAT-Calib framework[64]

    图  9   PANDA框架示意图[64]

    Figure  9.   Illustration of PANDA framework[64]

    图  10   Zhai等人[65]提出的基于迁移学习的微架构功耗建模流程

    Figure  10.   Transfer learning-based microarchitecture power modeling flow proposed by Zhai et al. [65]

    图  11   NoCeption框架示意图[68]

    Figure  11.   Illustration of NoCeption framework [68]

    图  12   功耗建模一般流程

    Figure  12.   Common power modeling flow

    图  13   ArchRanker框架示意图[75]

    Figure  13.   Illustration of ArchRanker framework[75]

    图  14   结合统计采样和AdaBoost学习的设计空间探索方法[77]

    Figure  14.   Design space exploration methodology combining statistical sampling and AdaBoost learning[77]

    图  15   BOOM-Explorer框架示意图[79]

    Figure  15.   Illustration of BOOM-Explorer framework[79]

    图  16   基于强化学习的微架构设计空间探索[80]

    Figure  16.   RL-based microarchitecture design space exploration[80]

    图  17   IT-DSE框架示意图[83]

    Figure  17.   Illustration of IT-DSE framework[83]

    图  18   微架构DAG和VGAE的训练方法[84]

    Figure  18.   Training methods of microarchitecture DAG and VGAE [84]

    图  19   设计空间探索一般流程

    Figure  19.   The general flow of design space exploration

    表  1   机器学习辅助的微架构功耗建模方法总结

    Table  1   A Summary of Machine Learning-Assisted Methods for Microarchitecture Power Modeling

    模型/文献 使用阶段 适用范围 建模特征 建模方法 模型误差
    PowerTrain[35] 运行时 不同微架构、不同负载 PMC+硬件描述 线性回归 约2%
    WattWatcher[56] 运行时 不同微架构、不同负载 PMC+硬件描述 线性回归 平均2.67%
    文献[53] 运行时 单一微架构、不同负载 PMC 线性回归 <9%
    文献[54] 运行时 单一微架构、不同负载 PMC 线性回归 2.8%~3.8%
    文献[55] 运行时 单一微架构、不同负载 PMC 非线性回归 平均6.8%
    文献[51] 设计时 不同微架构 微架构设计参数 非线性回归 中位5.4%
    文献[52] 设计时 单一微架构、不同负载 仿真活动统计 线性回归 约2.5%
    文献[57] 设计时 单一微架构、不同负载 仿真活动统计 线性回归 平均5.9%
    文献[5859] 设计时 不同微架构 微架构设计参数 神经网络 <2%
    文献[60] 设计时 单一微架构、不同负载 外部输入信号 机器学习模型 约3.6%
    McPAT-Calib[6364] 设计时 不同微架构、不同负载 架构参数、活动统计 解析模型+机器学习 3%~6%
    PANDA[65] 设计时 不同微架构、不同负载 架构参数、活动统计 解析函数+机器学习 2%~8%
    文献[66] 设计时 不同微架构、不同负载 架构参数、活动统计 神经网络+迁移学习 平均4.4%
    TrEnDSE[67] 设计时 不同微架构、跨负载 架构参数 集成模型+迁移学习 <1%
    NoCeption[68] 设计时 不同NoC配置及拓扑 架构参数 图神经网络 约2.5%
    下载: 导出CSV

    表  2   BOOM的微架构设计空间

    Table  2   Microarchitecture Design Space of BOOM

    模块 组件参数 描述 备选值
    前端 FetchWidth 一次性可取回指令数 4,8
    FetchBufferEntry 取指缓冲条目数 \mathrm{8,16,24,32,35,40}
    RasEntry 返回地址堆栈条目数 \mathrm{16,24,32}
    BranchCount 同时推测分支数 \mathrm{8,12,16,20}
    ICacheWay ICache组相连数 \mathrm{2,4},8
    ICacheTLB ICache地址翻译缓冲路 \mathrm{8,16,32}
    ICacheFetchBytes ICache行容量 2,4
    指令解码单元 DecodeWidth 一次性最多解码指令数 \mathrm{1,2},\mathrm{3,4},5
    RobEntry 重排序缓冲条目数 \mathrm{32,64,96,128,130}
    IntPhyRegister 整型寄存器数 \mathrm{48,64,80,96,112}
    FpPhyRegister 浮点型寄存器数 \mathrm{48,64,80,96,112}
    执行单元 MemIssueWidth 存储型指令发射宽度 1,2
    IntIssueWidth 整型指令发射宽度 \mathrm{1,2},\mathrm{3,4},5
    FpIssueWidth 浮点型指令发射宽度 1,2
    加载存储单元 LDQEntry 加载缓冲条目 \mathrm{8,16,24,32}
    STQEntry 存储缓冲条目 \mathrm{8,16,24,32}
    DCacheWay D-Cache 组相联数 \mathrm{2,4},8
    DCacheMSHR 缺失状态处理寄存器数 \mathrm{2,4},8
    DCacheTLB DCache地址翻译缓冲路 \mathrm{8,16,32}
    下载: 导出CSV

    表  3   BOOM处理器的不同微架构设计参数选择

    Table  3   Design Parameters Selectivity of Different Microarchitecture for BOOM Processors

    方法 微架构组件配置参数
    原始两发射BOOM[1415] {4, 16, 32, 12, 4, 8, 2, 2, 64, 80, 64, 1, 2, 1, 16, 16, 4, 2, 8}
    BOOM-Explorer[78] {4, 16, 16, 8, 2, 8, 2, 2, 32, 64, 64, 1, 3, 1, 24, 24, 8, 4, 8}
    BOOM-Explorer[79] {4, 16, 16, 8, 4, 8, 2, 2, 32, 64, 64, 1, 3, 1, 24, 24, 8, 4, 8}
    下载: 导出CSV

    表  4   基于机器学习的微架构设计空间探索方法总结

    Table  4   A Summary of Machine Learning-Based Methods for Microarchitecture Design Space Exploration

    方法/文献 探索目标 探索方法 PPA等数据来源
    文献[51] 帕累托前沿、流水线深度以及异构性分析 设计空间采样、统计学习 Turandot仿真器、PowerTimer工具
    文献[5859] 设计空间的预测模型 使用ANN建模遍历子空间 SESC仿真器、CACTI等
    文献[73] 设计空间的预测模型 使用ANN和线性回归建模遍历子空间 仿真器、Wattch、CACTI等
    ArchRanker[75] 特定目标下最优设计 基于RankBoost排名模型遍历子空间 仿真器、Wattch、CACTI等
    文献[77] 预测模型及最优设计 基于AdaBoost.RT模型和正交阵列采样 gem5仿真器
    BOOM-Explorer[7879] 帕累托最优设计 基于贝叶斯优化和深度核高斯过程建模 商业EDA工具
    文献[80] 特定偏好下最优设计 基于微架构缩放图的强化学习 商业EDA工具
    文献[82] 帕累托最优设计 基于集成树建模和主动学习 商业EDA工具
    IT-DSE[83] 帕累托最优设计 基于贝叶斯优化、不变风险最小化和Transformer 商业EDA工具
    GRL-DSE[84] 帕累托最优设计 基于图神经网络、集成模型、贝叶斯优化 商业EDA工具
    文献[86] 帕累托最优设计 基于BagGBRT和上置信界超体积提升 仿真器、McPAT等工具
    MoDSE[8788] 帕累托最优设计 基于AdaGBRT和帕累托超体积提升 仿真器、McPAT等工具
    文献[89] ML加速器最优设计 机器学习模型、图神经网络、贝叶斯优化 商业EDA工具
    SoC-Tuner[90] SoC帕累托最优设计 设计空间剪枝、贝叶斯优化 商业EDA工具
    下载: 导出CSV
  • [1] 国务院. 新时期促进集成电路产业和软件产业高质量发展若干政策[EB/OL]. [2023-12-25]. https://www.gov.cn/zhengce/content/2020-08/04/content_5532370.htm

    The State Council. Several policies to promote the high-quality development of integrated circuit industry and software industry in the new era[EB/OL]. [2023-12-25]. https://www. gov. cn/zhengce/content/2020-08/04/content_5532370. htm (in Chinese)

    [2] 陈云霁,蔡一茂,汪玉,等. 集成电路未来发展与关键问题——第347期“双清论坛(青年)”学术综述[J]. 中国科学:信息科学,2024,54(1):1−15

    Chen Yunqi, Cai Yimao, Wang Yu, et al. Integrated circuit technology: Future development and key issues–review of the 347th Shuangqing Forum (Youth)[J]. SCIENTIA SINICA Informationis, 2024, 54(1): 1−15 (in Chinese)

    [3]

    Xiang Chengxiang, Yang Yongan, Penner R M. Cheating the diffraction limit: Electrodeposited nanowires patterned by photolithography[J]. Chemical Communications, 2009, 8: 859−873

    [4]

    Chaudhry A, Kumar M J. Controlling short-channel effects in deep-submicron SOI MOSFETs for improved reliability: A review[J]. IEEE Transactions on Device and Materials Reliability, 2004, 4(1): 99−109

    [5]

    Thimbleby H. Modes, WYSIWYG and the von Neumann bottleneck[C]//Proc of IEE Colloquium on Formal Methods and Human-Computer Interaction: II. London: IET, 1988: 4/1−4/5

    [6]

    Zhou Zhihua. Machine Learning[M]. Singapore: Springer Nature Singapore , 2021

    [7] 梁云,卓成,李永福. EDA左移融合设计范式的发展现状、趋势与挑战[J]. 中国科学:信息科学,2024,54(1):121−129

    Liang Yun, Zhuo Cheng, Li Yongfu. The shift-left design paradigm of EDA: Progress and challenges[J]. SCIENTIA SINICA Informationis, 2024, 54(1): 121−129 (in Chinese)

    [8] 包云岗,常轶松,韩银和,等. 处理器芯片敏捷设计方法:问题与挑战[J]. 计算机研究与发展,2021,58(6):1131−1145

    Bao Yungang, Chang Yisong, Han Yinhe, et al. Agile design of processor chips: Issues and challenges[J]. Journal of Computer Research and Development, 2021, 58(6): 1131−1145 (in Chinese)

    [9]

    Scheffer L, Lavagno L. EDA for IC System Design, Verification, and Testing[M]. FL: CRC Press, Inc, 2018

    [10]

    Wu C M, Shieh M D, Wu C H, et al. VLSI architectural design tradeoffs for sliding-window log-MAP decoders[J]. IEEE Transactions on Very Large Scale Integration Systems, 2005, 13(4): 439−447

    [11] Brown S, Vranesic Z. 数字逻辑基础与Verilog设计[M]. 夏宇闻,须毓孝译. 原书第2版. 北京:机械工业出版社,2008

    Brown S, Vranesic Z. Fundamentals of Digital Logic with Verilog Design[M]. Translated by Xia Yuwen, Xu Yuxiao. 2nd. Beijing: China Machine Press, 2008 (in Chinese)

    [12]

    Rudell R L. Logic synthesis for VLSI design[R/OL]. Berkeley, California: University of California, Berkeley, 1989. [2023-12-25]. https://www2.eecs.berkeley.edu/Pubs/TechRpts/1989/1223.html

    [13]

    Sherwani N A. Algorithms for VLSI Physical Design Automation[M]. New York: Springer Science & Business Media New York, 2013

    [14]

    Celio C, David A P, Krste A. The Berkeley out-of order machine (BOOM): An industry-competitive, synthesizable, parameterized RISC-V processor[R]. Berkeley, CA: EECS Department, University of California, Berkeley, 2015

    [15]

    Zhao J, Abraham G. SonicBOOM: The 3rd generation Berkeley out-of-order machine[C]// Proc of 4th Workshop Computer Architecture Research with RISC-V. New York: ACM, 2020:1−7

    [16]

    Asanovic K, Rimas A, Jonathan B, et al. The rocket chip generator[R]. Berkeley, CA: EECS Department, University of California, Berkeley, 2015

    [17]

    Chen Chen, Xiang Xiaoyan, Liu Chang, et al. , Xuantie-910: A commercial multi-core 12-stage pipeline out-of-order 64-bit high performance RISC-V processor with vector extension: Industrial product[C]//Proc of ACM/IEEE Annual Int Symp on Computer Architecture. New York: ACM, 2020: 52−64

    [18] 徐易难,余子濠,王凯帆,等. 香山开源高性能RISC-V处理器设计与实现[J]. 计算机研究与发展,2023,60(3):476−493 doi: 10.7544/issn1000-1239.202221036

    Xu Yinan, Yu Zihao, Wang Kaifan, et al. XiangShan Open-source high performance RISC-V processor design and implementation[J]. Journal of Computer Research and Development, 2023, 60(3): 476−493 (in Chinese) doi: 10.7544/issn1000-1239.202221036

    [19]

    Bachrach J, Vo H, Richards B, et al. Chisel: Constructing hardware in a scala embedded language[C]//Proc of DAC Design Automation Conf. Piscataway, NJ: IEEE , 2012: 1212−1221

    [20]

    Winston P H. Artificial Intelligence[M]. London: Addison-Wesley Longman Publishing Co. , Inc. , 1984

    [21]

    Rapp M, Amrouch H, Lin Yibo, et al. MLCAD: A survey of research in machine learning for CAD keynote paper[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2021, 41(10): 3162−3181

    [22]

    Synopsys. PrimeTime[CP/OL]. [2023-12-25]. https://www.synopsys.com/implementation-and-signoff/signoff/primetime.html

    [23]

    Nesset S R. RTL Power Estimation Flow and Its Use in Power Optimization[M]. Norway: Norwegian University of Science and Technology, 2018

    [24]

    Brooks D, Tiwari V, Martonosi M. Wattch: A framework for architectural-level power analysis and optimizations[C]//Proc of IEEE/ACM Annual Int Symp on Computer Architecture. New York: ACM, 2000: 83−94

    [25]

    Thoziyoor S, Ahn J H, Monchiero M, et al, A comprehensive memory modeling tool and its application to the design and analysis of future memory hierarchies[C]//Proc of Int Symp on Computer Architecture. Piscataway,NJ: IEEE, 2008: 51-62

    [26]

    Li Sheng, Ahn J H, Strong R D, et al. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures[C]//Proc of IEEE/ACM Int Symp on Microarchitecture. New York: ACM, 2009: 469−480

    [27]

    Burger D, Todd M A. The SimpleScalar tool set, version 2.0[J]. ACM SIGARCH Computer Architecture News, 1997, 25: 13−25

    [28]

    Alec R, Mircea R S. RISC5: Implementing the RISC-V ISA in gem5[C]//Proc of the 1st Workshop on Computer Architecture Research with RISC-V. Piscataway, NJ: IEEE, 2017: 1−7

    [29]

    Binkert N L, Dreslinski R G, Hsu L R, et al. The M5 simulator: Modeling networked systems [J]. IEEE Micro, 2006, 26(4): 52−60

    [30]

    Carlson T E, Heirman W, Eeckhout L. Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulation[C]//Proc of Int Conf for High Performance Computing, Networking, Storage and Analysis. Piscataway, NJ: IEEE, 2011: 1−12

    [31]

    Semiconductor industries association. model for assessment of CMOS technologies and roadmaps (MASTAR)[EB/OL]. [2023-12-25] https://web.archive.org/web/20130709053354/http://www.itrs.net/models.html

    [32]

    Brooks D, Bose P, Srinivasan V, et al. New methodology for early-stage, microarchitecture-level power-performance analysis of microprocessors[J]. IBM Journal of Research and Development, 2003, 47((5/6): ): 653−670

    [33]

    Wang Hangsheng, Zhu Xinping, Li-Shiuan P, et al. Orion: A power-performance simulator for interconnection networks[C]//Proc of IEEE/ACM Int Symp on Microarchitecture. Piscataway, NJ: IEEE, 2002: 294−305

    [34]

    Xi S L, Jacobson H, Bose P, et al. Quantifying sources of error in McPAT and potential impacts on architectural studies[C]//Proc of IEEE Int Symp on High Performance Computer Architecture. Piscataway, NJ: IEEE, 2015: 577−589

    [35]

    Lee W, Kim Y, Ryoo J H, et al. PowerTrain: A learning-based calibration of McPAT power models[C]//Proc of IEEE Int Symp on Low Power Electronics and Design. Piscataway, NJ: IEEE, 2015: 189−194

    [36]

    Tang A, Yang Y, Lee C Y et al. McPAT-PVT: Delay and power modeling framework for FinFET processor architectures under PVT variations[J]. IEEE Transactions on Very Large Scale Integration Systems, 2015, 23(9): 1616−1627

    [37]

    Guler A, Jha N K. McPAT-Monolithic: An area/power/timing architecture modeling framework for 3-D hybrid monolithic multicore systems[J]. IEEE Transactions on Very Large Scale Integration Systems, 2020, 28(10): 2146−2156

    [38]

    Ravipati D P, Van S, Victor M, et al. Performance and energy studies on NC-FinFET cache-based systems with FN-McPAT[J]. IEEE Transactions on Very Large Scale Integration Systems, 2023, 31(9): 1280−1293

    [39]

    Van den Steen S, De Pestel S, Mechri M, et al. Micro-architecture independent analytical processor performance and power modeling[C]//Proc of IEEE Int Symp on Performance Analysis of Systems and Software. Piscataway, NJ: IEEE, 2015: 32−41

    [40]

    Park Y H, Pasricha S, Kurdahi F J , et al. A multi-granularity power modeling methodology for embedded processors[J]. IEEE Transactions on Very Large Scale Integration Systems, 2010, 19(4): 668−681

    [41]

    Ansys. PowerArtist[CP/OL]. [2023-12-25]. https://www.ansys.com/products/semiconductors/ansys-powerartist

    [42]

    Mentor. PowerPro RTL low-power[CP/OL]. [2023-12-25]. https://www.mentor.com/hls-lp/powerpro-rtl-low-power/

    [43]

    Bogliolo A, Benini L, De Micheli G. Regression-based RTL power modeling[J]. ACM Transactios on Design Automation of Electronic Systems, 2000, 5(3): 337−372

    [44]

    Sunwoo D, Wu G Y, Patil N A. PrEsto: An FPGA-accelerated power estimation methodology for complex systems[C]//Proc of IEEE Int Conf on Field Programmable Logic and Applications. Piscataway, NJ: IEEE, 2010: 310−317

    [45]

    Yang Jianlei, Ma Liwei, Zhao Kang, et al. Early stage real-time SoC power estimation using RTL instrumentation[C]//Proc of IEEE/ACM Asia and South Pacific Design Automation Conf. Piscataway, NJ: IEEE, 2015: 779−784

    [46]

    Zhou Yuan, Ren Haoxing, Zhang Yanqing, et al. PRIMAL: Power inference using machine learning [C]//Proc of ACM/IEEE Design Automation Conf. New York: ACM, 2019: 1−6

    [47]

    Kim D, Zhao J, Bachrach J, et al. Simmani: Runtime power modeling for arbitrary RTL with automatic signal selection[C]//Proc of IEEE/ACM Int Symp on Microarchitecture. Piscataway, NJ: IEEE, 2019: 1050−1062

    [48]

    Zhang Yanqing, Ren Haoxing, Khailany B. GRANNITE: Graph neural network inference for transferable power estimation[C]//Proc of ACM/IEEE Design Automation Conf. New York: ACM, 2020: 1−6

    [49]

    Xie Zhiyao, Xu Xiaoqing, Walker M, et al. APOLLO: An automated power modeling framework for runtime power introspection in high-volume commercial microprocessors[C]//Proc of IEEE/ACM Int Symp on Microarchitecture. Piscataway, NJ: IEEE, 2021: 1−14

    [50]

    Fang Wenji, Lu Yao, Liu Shang, et al. MasterRTL: A pre-synthesis PPA estimation framework for any RTL design[C]//Proc of IEEE/ACM Int Conf on Computer Aided Design. Piscataway, NJ: IEEE, 2023: 1−9

    [51]

    Lee B C, Brooks D M. Illustrative design space studies with microarchitectural regression models[C]//Proc of IEEE Int Symp on High Performance Computer Architecture. Piscataway, NJ: IEEE, 2007: 340−351

    [52]

    Jacobson H, Buyuktosunoglu A, Bose P, et al. Abstraction and microarchitecture scaling in early-stage power modeling[C] // Proc of IEEE Int Symp on High Performance Computer Architecture. Piscataway, NJ: IEEE, 2011: 394−405

    [53]

    Bircher W L, John L K. Complete system power estimation: A trickle-down approach based on performance events[C] // Proc of IEEE Int Symp on Performance Analysis of Systems & Software. Piscataway, NJ: IEEE, 2007: 158−168

    [54]

    Walker M J, Diestelhorst S, Hansson A, et al. Accurate and stable run-time power modeling for mobile and embedded CPUs[J]. IEEE Transactios on Computer-Aided Design of Integrated Circuits and Systems, 2017, 36(1): 106−119

    [55]

    Sagi M, Doan N A V, Rapp M, et al. A lightweight nonlinear methodology to accurately model multicore processor power[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2020, 39(11): 3152−3164

    [56]

    Lebeane M, Ryoo J H , Panda R, et al. WattWatcher: Fine-grained power estimation for emerging workloads[C]//Proc of Int Symp on Computer Architecture and High Performance Computing. New York: ACM, 2015: 106−113

    [57]

    Reddy B K, Walker M J , Balsamo D et al. Empirical CPU power modelling and estimation in the gem5 simulator[C] // Proc of IEEE Int Workshop on Power and Timing Modeling, Optimization and Simulation. Piscataway, NJ: IEEE, 2017: 1−8

    [58]

    Ipek E, McKee S A, Caruana R, et al. Efficiently exploring architectural design spaces via predictive modeling[C]//Proc of ACM Int Conf on Architectural Support for Programming Languages and Operating Systems. New York: ACM, 2006: 195−206

    [59]

    Ipek E, McKee S A, Singh K, et al. Efficient architectural design space exploration via predictive modeling[J]. ACM Transactions on Architecture and Code Optimization, 2008, 4(4): 1−34

    [60]

    Kumar A K A, Al-Salamin S, Amrouch H, et al. Machine learning-based microarchitecturelevel power modeling of CPUs[J]. IEEE Transactions on Computers, 2023, 72(4): 941−956

    [61]

    Wilson S Verilator [CP/OL]. [2023-12-25]. https://www.veripool.org/wiki/verilator

    [62]

    Rossi D, Conti F, Marongiu A, et al. PULP: A parallel ultra low power platform for next generation IoT applications[C]//Proc of IEEE Hot Chips Symp. Piscataway, NJ: IEEE, 2015: 1−39

    [63]

    Zhai Jianwang, Bai Chen, Zhu Binwu, et al. McPAT-Calib: A microarchitecture power modeling framework for modern CPUs[C]//Proc of IEEE/ACM Int Conf on Computer-Aided Design. Piscataway, NJ: IEEE, 2021: 1−9

    [64]

    Zhai Jianwang, Bai Chen, Zhu Binwu, et al. McPAT-Calib: A RISC-V BOOM microarchitecture power modeling framework[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2023, 42(1): 243−256

    [65]

    Zhang Qijun, Li Shiyu, Zhou Guanglei, et al. PANDA: Architecture-level power evaluation by unifying analytical and machine learning solutions[C]//Proc of IEEE/ACM Int Conf on Computer Aided Design. Piscataway, NJ: IEEE, 2021: 1−9

    [66]

    Zhai Jianwang, Cai Yici, Yu Bei. Microarchitecture power modeling via artificial neural network and transfer learning[C]//Proc of IEEE/ACM Asia and South Pacific Design Automation Conf. Piscataway, NJ: IEEE, 2023: 1−6

    [67]

    Wang Duo, Yan Mingyu, Teng Yihan, et al. A Transfer learning framework for high-accurate cross-workload design space exploration of CPU[C]//Proc of IEEE/ACM Int Conf on Computer Aided Design. Piscataway, NJ: IEEE, 2023: 1−9

    [68]

    Li Fuping, Wang Ying, Liu Cheng et al. NoCeption: A fast PPA prediction framework for network-on-chips using graph neural network[C]//Proc of Design, Automation & Test in Europe Conf & Exhibition. Piscataway, NJ: IEEE, 2022: 1035−1040

    [69]

    Guo Qi, Chen Tianshi, Chen Yunji, et al. Accelerating architectural simulation via statistical techniques: A survey[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2016, 35(3): 433−446

    [70]

    Karkhanis T S, Smith J E. A first-order superscalar processor model[C]//Proc of IEEE/ACM Int Symp on Computer Architecture. Piscataway, NJ: IEEE, 2004: 338−349

    [71]

    Karkhanis T S, Smith J E. Automated design of application specific superscalar processors: An analytical approach[C]//Proc of IEEE/ACM Int Symp on Computer Architecture. Piscataway, NJ: IEEE, 2007: 402−411

    [72]

    Lee J, Jang H, Kim J. RPStacks: Fast and accurate processor design space exploration using representative stall-event stacks[C]//Proc of IEEE/ACM Int Symp on Microarchitecture. Piscataway, NJ: IEEE, 2014: 255−267

    [73]

    Bai Chen, Huang Jiayi, Wei Xuechao, et al. ArchExplorer: Microarchitecture exploration via bottleneck analysis[C]//Proc of Annual IEEE/ACM Int Symp on Microarchitecture. Piscataway, NJ: IEEE, 2023: 268−282

    [74]

    Dubach C, Jones T, O'Boyle M. Microarchitectural design space exploration using an architecture-centric approach[C]//Proc of IEEE/ACM Int Symp on Microarchitecture. Piscataway, NJ: IEEE, 2007: 262−271

    [75]

    Chen Tianshi, Guo Qi, Tang Ke, et al. ArchRanker: A ranking approach to design space exploration[J]. ACM SIGARCH Computer Architecture News, 2014, 42(3): 85−96

    [76]

    Freund Y, Iyer R, Schapire R E, et al. An efficient Boosting algorithm for combining preferences[J]. Journal of Machine Learning Research, 2003, 4(9): 933−969

    [77]

    Li Dandan, Yao Shuzhen, Liu Yuhang, et al. Efficient design space exploration via statistical sampling and AdaBoost learning[C]//Proc of ACM/IEEE Design Automation Conf. New York: ACM, 2016: 1−6

    [78]

    Bai Chen, Sun Qi, Zhai Jianwang, et al. BOOM-Explorer: RISC-V BOOM microarchitecture design space exploration framework[C]//Proc of IEEE/ACM Int Conf on Computer-Aided Design. Piscataway, NJ: IEEE, 2021: 1−9

    [79]

    Bai Chen, Sun Qi, Zhai Jianwang, et al. BOOM-Explorer: RISC-V BOOM microarchitecture design space exploration framework[J]. ACM Transactions on Design Automation of Electronic Systems, 2024, 29(1): 1−23

    [80]

    Bai Chen, Zhai Jianwang, Ma Yuzhe, et al. Towards automated RISC-V microarchitecture design with reinforcement learning[C]//Proc of AAAI Conf on Artificial Intelligence. Menlo, CA: AAAI, 2024: 1−9

    [81]

    Eyerman S, Eeckhout L, Karkhanis T, et al. A mechanistic performance model for superscalar out-of-order processors[J]. ACM Transactions on Computer Systems, 2009, 27(2): 1−37

    [82]

    Zhai Jianwang, Cai Yici. Microarchitecture design space exploration via Pareto-driven active learning[J]. IEEE Transactions on Very Large Scale Integration Systems, 2023, 31(11): 1727−1739

    [83]

    Yu Ziyang, Bai Chen, Hu Shoubo, et al. IT-DSE: Invariance risk minimized transfer microarchitecture design space exploration[C]//Proc of IEEE/ACM Int Conf on Computer Aided Design. Piscataway, NJ: IEEE, 2023: 1−9

    [84]

    Yi Xiaoling, Lu Jialin, Xiong Xiankui, et al. Graph representation learning for microarchitecture design space exploration[C]//Proc of ACM/IEEE Design Automation Conf. New York: ACM, 2023: 1−6

    [85]

    Zhang Muhan, Jiang Shali, Cui Zhicheng, et al. D-VAE: A variational autoencoder for directed acyclic graphs[J]. arXiv preprint, arXiv: 1904.11088, 2019

    [86]

    Wang Duo, Yan Mingyu, Teng Yihan, et al. A high-accurate multi-objective ensemble exploration framework for design space of CPU microarchitecture[C]//Proc of the Great Lakes Symp on VLSI. New York: ACM, 2023: 379–383

    [87]

    Wang Duo, Yan Mingyu, Teng Yihan, et al. A high-accurate multi-objective exploration framework for design space of CPU[C] // Proc of ACM/IEEE Design Automation Conf. Piscataway, NJ: IEEE, 2023: 1−6

    [88]

    Wang Duo, Yan Mingyu, Teng Yihan, et al. MoDSE: A high-accurate multi-objective design space exploration framework for CPU microarchitectures[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and System, 2024, 43(5): 1525−1537

    [89]

    Esmaeilzadeh H, Ghodrati S, Kahng A B, et al. An Open-source ML-based full-stack optimization framework for machine learning accelerators[J]. arXiv preprint, arXiv: 2308.12120, 2023

    [90]

    Chen Shixin, Zheng Su, Bai Chen, et al. SoC-Tuner: An importance-guided exploration framework for DNN-targeting SoC design[C] // Proc of IEEE/ACM Asian and South Pacific Design Automation Conf. Piscataway, NJ: IEEE, 2024: 1−6

    [91]

    Genc H, Kim S, Amid A, et al. Gemmini: Enabling systematic deep-learning architecture evaluation via full-stack integration[C] // Proc of ACM/IEEE Design Automation Conf. New York: ACM, 2021: 769–774

    [92]

    Li Sicheng, Bai Chen, Wei Xuechao, et al. 2022 ICCAD CAD contest problem C: Microarchitecture design space exploration[C] // Proc of IEEE/ACM Int Conf on Computer-Aided Design. Piscataway, NJ: IEEE, 2022: 1−7

    [93]

    Bai chen. ICCAD contest platform [EB/OL]. [2024-01-02]. http://47.93.191.38/

  • 期刊类型引用(0)

    其他类型引用(1)

图(19)  /  表(4)
计量
  • 文章访问数:  840
  • HTML全文浏览量:  203
  • PDF下载量:  235
  • 被引次数: 1
出版历程
  • 收稿日期:  2024-01-31
  • 修回日期:  2024-03-18
  • 网络出版日期:  2024-04-14
  • 刊出日期:  2024-05-31

目录

/

返回文章
返回