面向处理器微架构设计空间探索的加速方法综述

王铎; 刘景磊; 严明玉; 滕亦涵; 韩登科; 叶笑春; 范东睿

doi:10.7544/issn1000-1239.202330348

面向处理器微架构设计空间探索的加速方法综述

王铎^{1, 2,},
刘景磊³,
严明玉^{1, 2, ,},
滕亦涵^{1, 2},
韩登科^{1, 2},
叶笑春^{1, 2},
范东睿^{1, 2}

1.
处理器芯片全国重点实验室（中国科学院计算技术研究所）　北京　100190
2.
中国科学院大学计算机科学与技术学院　北京　100049
3.
中国移动研究院　北京　100053

基金项目: 国家自然科学基金项目（62202451）；中国科学院国际伙伴计划项目（171111KYSB20200002）；中国科学院稳定支持基础研究领域青年团队计划项目（YSBR-029）；中国科学院青年创新促进会项目（Y2021039）；中科院计算所-中国移动研究院联合创新平台项目

详细信息

作者简介:
王铎: 1995年生. 博士. CCF学生会员. 主要研究方向为处理器设计空间探索、计算机体系结构

刘景磊: 1982年生. 硕士，高级工程师. 主要研究方向为算力网络、计算机体系结构

严明玉: 1990年生. 博士，副研究员. CCF会员. 主要研究方向为基于图的硬件加速器、数据流架构

滕亦涵: 2000年生. 硕士. 主要研究方向为基于图的硬件加速器和高吞吐量计算体系结构

韩登科: 1998年生. 硕士研究生. 主要研究方向为基于图的硬件加速器和高吞吐量计算体系结构

叶笑春: 1981年生. 博士，研究员. CCF会员. 主要研究方向为软件仿真、算法并行优化、高性能计算机架构

范东睿: 1979年生. 博士，研究员. CCF杰出会员. 主要研究方向为众核处理器设计、高通量处理器设计、低功耗微架构

通讯作者:
严明玉（yanmingyu@ict.ac.cn）

中图分类号: TP302
计量
- 文章访问数: 287
- HTML全文浏览量: 94
- PDF下载量: 141
出版历程
- 收稿日期: 2023-05-06
- 修回日期: 2024-01-14
- 网络出版日期: 2024-11-12
- 刊出日期: 2024-12-31

Acceleration Methods for Processor Microarchitecture Design Space Exploration: A Survey

Wang Duo^{1, 2,},
Liu Jinglei³,
Yan Mingyu^{1, 2, ,},
Teng Yihan^{1, 2},
Han Dengke^{1, 2},
Ye Xiaochun^{1, 2},
Fan Dongrui^{1, 2}

1.
State Key Lab of Processors (Institute of Computing Technology, Chinese Academy of Sciences), Beijing 100190
2.
School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing 100049
3.
China Mobile Communications Research Institute, Beijing 100053

Funds: This work was supported by the National Natural Science Foundation of China (62202451), the International Partnership Program of Chinese Academy of Sciences (171111KYSB20200002), the CAS Project for Young Scientists in Basic Research (YSBR-029), the CAS Project for Youth Innovation Promotion Association (Y2021039), and the Institute of Computing Technology, Chinese Academy of Sciences-China Mobile Communications Group Co., Ltd. Joint Institute.

More Information

Author Bio:
Wang Duo: born in 1995. PhD. Student member of CCF. His main research interests include processor design space exploration and computer architecture

Liu Jinglei: born in 1982. Master, senior engineer. His main research interests include computility network and computer architecture

Yan Mingyu: born in 1990. PhD, associate professor. Member of CCF. His main research interest includes graph based hardware accelerator and dataflow architecture

Teng Yihan: born in 2000. Master. His main research interests include graph-based hardware accelerator and high-throughput computer architecture

Han Dengke: born in 1998. Master candidate. His main research interest includes graph-based hardware accelerator and high-throughput computer architecture

Ye Xiaochun: born in 1981. PhD, professor. Member of CCF. His main research interests include software simulation, algorithm paralleling and optimizing, and architecture for high performance computer

Fan Dongrui: born in 1979. PhD, professor. Distinguished member of CCF. His main research interests include manycore processor design, high throughput processor design, and low power microarchitecture

摘要

摘要:
中央处理器是目前最重要的算力基础设施. 为了最大化收益，架构师在设计处理器微架构时需要权衡性能、功耗、面积等多个目标. 但处理器运行负载的指令多，单个微架构设计点的评估耗时从10 min到数十小时不等. 加之微架构设计空间巨大，全设计空间暴力搜索难以实现. 近些年来许多机器学习辅助的设计空间探索加速方法被提出，以减少需要探索的设计空间或加速设计点的评估，但缺少对加速方法的全面调研和系统分类的综述. 对处理器微架构设计空间探索的加速方法进行系统总结及分类，包含软件设计空间的负载选择、负载指令的部分模拟、设计点选择、模拟工具、性能模型5类加速方法. 对比了各加速方法内文献的异同，覆盖了从软件选择到硬件设计的完整探索流程. 最后对该领域的前沿研究方向进行了总结，并放眼于未来的发展趋势.
- 处理器微架构设计 /
- 设计空间探索 /
- 性能模型 /
- 负载选择 /
- 软件模拟
Abstract:
Central processing unit is the most important computing infrastructure nowadays. To maximize the profit, architects design the processor microarchitecture by trading-off multiple objectives including performance, power, and area. However, because of the tremendous instructions of workloads running on the processors, the evaluation of individual microarchitecture design point costs minutes to hours. Furthermore, the design space of the microarchitecture is huge, which results that the exploration of comprehensive design space is unrealistic. Therefore, many machine-learning-assisted design space exploration acceleration methods are proposed to reduce the size of evaluated design space or accelerate the evaluation of a design point. However, a comprehensive survey summarizing and systematically classifying recent acceleration methods is missing. This survey paper systematically summarizes and classifies the five kinds of acceleration methods for the design space exploration of the processor microarchitecture, including the workload selection of software design space, the partial simulation of workload instructions, the design point selection, the simulation tools, and the performance models. This paper systematically compares the similarities and differences between papers in the acceleration methods, and covers the complete exploration process from the software workload selection to the hardware microarchitecture design. Finally, the research direction is summarized, and the future development trend is discussed.
- processor microarchitecture design /
- design space exploration /
- performance model /
- workload selection /
- software simulation

HTML全文

随着信息时代的发展，人们被越来越多的信息数据包围. 企业为了从海量的信息数据中提取出有用信息并为企业带来效益，推荐算法被广泛应用于各大企业的在线服务中^[1-7]. 推荐系统旨在通过历史交互数据对用户和项的表征进行建模，发现隐藏在数据背后的模式和规律，进而为决策提供支持和指导^[8-18]. 然而，传统的推荐系统只关注用户与项在单一域的交互，这相对于用户-项交互关系的总数量来说是相当小的，意味着数据稀疏性仍然是一个需要克服的问题^[19-23]. 同时，对于进入系统的新用户和新项来说，缺乏历史交互数据造成的数据稀疏也是一个严重问题，也就是所谓的冷启动问题^[24-32].

为解决数据稀疏问题并提高推荐准确性，跨域推荐（cross-domain recommendation，CDR）方法被提出^[33-34]. CDR利用其他域的相关信息来协助目标域的预测任务. 例如，喜欢喜剧电影而不喜欢爱情电影、喜欢笑话集而不喜欢爱情小说的用户，其个人表征反映了他们对喜剧项的偏好和对爱情项的厌恶. 现有的CDR方法通常通过学习不同域之间重叠用户或项的潜在表征，再结合来自不同域的共享信息来作为信息传递的桥梁.

尽管传统的CDR方法已取得一定的研究进展，但仍然存在一定局限性. 如现有的非图的CDR方法忽略了用户-项交互关系的高阶隐含特征和用户-项交互图的高阶结构特征，导致不能完全捕捉到用户-项交互的复杂性^[35]. 这种局限性导致推荐效果有待改善. 具体来说，传统方法只能隐式地捕捉协同信号（即使用用户-项交互信息作为监督信号），可以看作是利用一跳邻居的交互信息来进行用户的表征学习. 而将用户与项交互信息显示建模成交互图（即拓扑结构），利用图神经网络在交互图上提取出来的高阶交互信息可以自然、显式地编码关键的协同信号. 利用图神经网络提取用户-项交互信息时，每个节点（用户或项）不仅能够与一跳邻居进行交互，还能通过图结构与间接相连的用户或项（多跳邻居）进行交互，因此得到的高阶交互信息能够包含更多的上下文关系和更丰富的特征，从而提高推荐性能. 因此，设计能够捕捉高阶特征的新方法对于提高跨域推荐的准确性至关重要.

为了从用户-项交互图中捕捉高阶信息，图卷积网络（graph convolutional network，GCN）已被广泛应用于推荐系统^[36-37]. GCN使用初始属性或结构特征初始化节点表示，通过递归聚合更新每个节点，最后根据下游任务读出节点或图的最终表示^[38]. 基于GCN的推荐模型通常将用户与项的交互视为用户与项的2-部图，并在图中传播信息和聚合邻近节点的特征，从而获得用户和项的高质量特征嵌入. 最近，一些研究工作通过使用GCN来实现跨域推荐任务^[39-41]. 但这些工作在跨域特征提取上为每一个用户交互序列构建交互子图或在每个域上单独构建域的子图，并没有构建一个统一的不同域间用户-项交互图. 由于非活跃用户通常交互项较少，只依靠用户子图不足以生成高质量的跨域表示从而限制了推荐系统的偏好表达能力，而分开建模域的子图无法提取到丰富的跨域特征. 建模一个统一的用户-项交互图有助于提取丰富的跨域特征，提高跨域推荐性能. 此外，现有的基于图的CDR方法也没有考虑到基于图卷积的方法普遍面临的过平滑问题.

针对上述问题，我们提出了一个新框架，称为图卷积宽度跨域推荐系统（graph convolutional broad cross-domain recommender system，GBCD）. 该推荐系统利用GCN获取多个域内的高阶相似性和结构特征，从而进一步提高推荐性能，并缓解上述问题.

本文在建立的模型过程中必须应对2个主要挑战. 第1个挑战是构建不同域之间的用户-项交互图，第2个挑战是制定有效的策略在不同域中通过GCN提取高阶信息所构成的高质量用户-项跨域嵌入向量. 为了应对这2个挑战，本文提出了一种基于多部图概念的新方法. 具体地说，我们开发了一个 $({D} + 1) {\text{-}}$ 部图，该图建立了多个域的项和重叠用户之间的关系，其中重叠用户作为传递信息的桥梁，如图1所示. 在同一域内，类似的项也被链接起来. 然后，使用GCN来聚合邻近节点的关系，并提取用户和项的特征.

图 1 所提出的(D + 1) -部图示意图

Figure 1. Illustration of the proposed (D + 1)-partite graph

下载: 全尺寸图片幻灯片

针对基于GCN方法普遍面临的过度平滑问题，即由于邻近节点信息的信息过度聚合，模型的鉴别性能降低. 我们引入了宽度学习系统（broad learning system，BLS）^[42]作为非线性近似器，BLS可以根据任何连续概率分布使用随机隐藏层权重将原始样本映射到一个具有区分度的特征空间. 通过随机权重向模型中引入随机噪声，可以有效地增强模型的鲁棒性，进而缓解过度平滑问题.

在GBCD中，我们遵循了大多数 GCN推荐模型的思路，摒弃了对特征聚合帮助不大的非线性激活部分. 但与简化图卷积推荐系统（simplifying and powering graph convolution network for recommendation，L-GCN）^[43]不同的是，我们没有放弃权重矩阵的训练过程，实现了输入节点特征的降维. 在模型训练过程中，我们将每个 GCN的结果输入BLS进行评分预测. 由于GCN网络的训练易受噪音的影响，例如：不可靠的交互等，为此我们提出了一种新的面向任务的优化损失函数. 该损失函数根据最终推荐任务的BLS输出性能反馈训练GCN网络. 通过这种方法，可有效地训练GBCD并提高其在推荐任务中的性能.

本文的主要贡献有3个方面：1）专注于探索如何从多个域学习高阶特征，创新性地将不同域的用户-项交互信息构建成 $({D} + 1) {\text{-}}$ 部图. 2）提出了一种新的模型 GBCD，它是一种基于图神经网络的宽度跨域推荐系统. 此外，还设计了一种新的面向任务的损失函数来训练GBCD. 3）在2个大规模真实数据集上对GBCD进行了综合实验评估，结果表明GBCD 显著提高了推荐性能.

1. 相关工作

1.1 跨域推荐

CDR方法已被提出作为解决推荐系统中冷启动和数据稀疏性挑战的一种解决方案. 多年来，CDR的各种变体被开发出来，每一种都有其独特的特点和局限性. 例如，集体矩阵分解（collective matrix factorization，CMF）^[44]假设存在一个跨所有域共享的全局嵌入矩阵，并同时从多个域分解该矩阵. 在低秩和稀疏的跨域推荐（low-rank and sparse cross-domain recommendation，LSCD）^[2]中，对每个域分别提取用户和项的潜在特征矩阵，而不是将每个域的评分矩阵分解为3个低维矩阵3次. 此外，用户的特性被自适应地分为共享组件和域特定组件. 近年来，深度学习模型也被引入CDR中. 例如，在文献[3]中提出一种新的自动编码器框架，它可以跨域传输和融合信息，以做出更准确的评分预测.Zhu等人^[45]提出一个基于矩阵分解模型和全连接深度神经网络的跨域和跨系统推荐的深度框架. 嵌入映射跨域推荐系统（cross-domain recommendation：an embedding and mapping approach，EMCDR）^[46]在每个域中利用隐因子模型学习用户和项特征，在不同域间将数据从丰富域映射到稀疏域实现跨域推荐. 用户偏好个性化迁移（personalized transfer of user preferences，PTUP）推荐系统^[27]使用元网络为每个用户生成1个个性化的信息桥梁功能，进而为每个用户学习个性化跨域表示. 同时，Li等人^[28]提出了一种新的对抗性学习方法，该方法将从不同域中生成的用户嵌入向量统一为每个用户的1个全局用户表示来进行跨域推荐. Cao等人^[29]通过信息瓶颈的原理建模领域间去偏共享信息来实现跨域推荐. 而解耦跨域推荐系统（disentangled representations for cross-domain recommendation，DisenCDR）^[47]通过解耦领域共享和领域特定信息，并利用互信息规则来增强跨域推荐性能.Xu等人^[48]通过双重嵌入结构、自适应的传递矩阵、注意机制，有效地处理特征维度和潜在空间的异质性来实现跨域推荐.Xie等人^[26]通过构建多样化偏好网络和域内域间的对比学习任务来解决跨域推荐中的数据偏差问题.

1.2 基于GCN的推荐

近年来，研究人员一直在探索利用图神经网络提取用户-项交互图中的特征以更好地预测用户的偏好. 其中一种方法是基于图卷积的矩阵补全（graph convolutional matrix completion，GC-MC）^[49]，该方法在编码交互特征时，通过GCN来利用用户和项之间的连接. 另一种方法将GCN集成到嵌入表征学习过程中的框架——神经图协同滤波（neural graph collaborative filtering，NGCF）^[50]. NGCF覆盖多个嵌入传播层，通过传播层捕获用户和项之间的高阶连接的协同信息. Chen等人^[51]去掉非线性激活函数，并使用残差学习方法来解释连接各层输出的原因. 为简化NGCF，L-GCN^[43]删除了对协同滤波没有正面作用的激活和转换函数等操作. 此外，还有一种新的跨域推荐的双向迁移学习方法被提出，即基于图协同滤波网络的双向转移（bi-directional transfer graph collaborative filtering networks，BiTGCF）模型^[52].BiTGCF不仅通过一个新的特征传播层建模单域用户-项图中的高阶连通性，还利用公共用户作为桥梁实现2个知识跨域的双向转移.

1.3 宽度学习系统（BLS）

2017年，BLS^[42]作为一种新型的浅层神经网络模型被提出. 类似于深度神经网络，BLS可以近似逼近非线性函数，并对此进行了严格的分析论证^[53]. BLS被设计为一个浅层的扁平网络，其中原始输入数据通过连续的概率分布映射到特征节点中，然后在宽度扩展中用节点进行增强. 这种设计可以实现快速的训练过程，因为只需要使用伪逆向算法训练从隐藏层到输出层的权重. 因此，与基于深度神经网络的模型相比，BLS不需要大量训练时间，而且由于其存储的参数数量较少，更适用于大规模数据集.

2. GBCD模型

在本节中，我们将详细描述所提出的GBCD，如所示.GBCD的目标是为多个领域中的重叠用户进行推荐，GBCD的关键思路是从 $({D} + 1) {\text{-}}$ 部图中提取潜在的特征，该图是利用源域和目标域的信息构造的. 利用 $({D} + 1) {\text{-}}$ 部图上的多图卷积网络(MGCN)，生成了一个捕获相关信息的特征向量. 进一步地，为了优化所获得的特征向量并消除任何相关的噪声，我们利用BLS从数据中分析和提取有价值的特征.表1中记录了本文中出现的符号汇总.

图 2 图卷积宽度跨域推荐系统(GBCD)的整体框架

Figure 2. An overall framework of graph convolutional broad cross-domain recommender system (GBCD)

下载: 全尺寸图片幻灯片

表 1 主要符号描述表

Table 1. Description Table of the Main Notations

符号	描述
${\mathcal{U}}, {\cal V}^{d}, {\cal V}=\left\{{\cal V}^{1}, {\cal V}^{2},…, {\cal V}^{D}\right\}$	用户集，第 ${d}$ 个项域中的项集， ${D}$ 个项域组成的项节点集
${{{\boldsymbol{R}}}^{t}} \in {\mathbb{R}^{\left\| {\mathcal{U}} \right\| \times \left\| {{{\mathcal{V}}^{d}}} \right\|}}$	第 ${d}$ 个域中的用户 ${\text{-}}$ 项评分矩阵
$\mathcal{G}^{D+1}=\{\mathcal{U}, \mathcal{V}, \mathcal{E}\}$	$({D} + 1) {\text{-}}$ 部图
$\mathcal{E}=\left\{\mathcal{E}^{1}, \mathcal{E}^{2}, …, \mathcal{E}^{D}\right\}$	边集，每个 ${{\mathcal{E}}^{d}}$ 都是连接 ${\mathcal{U}}$ 和 ${{\mathcal{V}}^{d}}$ 之间节点的边
${{\boldsymbol{A}}}$	多域用户项的加权邻接矩阵
${\hat {\boldsymbol{A}}}$	基于 ${\boldsymbol{A}}$ 增加自连接的邻接矩阵
${\hat {\boldsymbol{D}}}$	$\hat {\boldsymbol{A}}$ 的度矩阵
${{\boldsymbol{e}}}_{{{v}^{d}}}^{u} = [{{\boldsymbol{U}}}\|{{{\boldsymbol{V}}}^{d}}] \in {\mathbb{R}^{1 \times 2{N}}}$	跨域协同滤波嵌入向量
${{{\boldsymbol{E}}}^0},{{{\boldsymbol{E}}}^{k}}$	${{\boldsymbol{E}}^0}$ 是 $({D} + 1){\text{ -}}$ 部图的特征矩阵， ${{\boldsymbol{E}}^{k}}$ 是第 ${k}$ 层的特征矩阵
${{\boldsymbol{E}}}$	跨域协同矩阵，由跨域协同滤波嵌入向量拼接得到
${\mathcal{W}^{\mathrm{g}}} = \{ {{{\boldsymbol{W}}}^1},{{{\boldsymbol{W}}}^2},…,{{{\boldsymbol{W}}}^{k}}\}$	MGCN模块的权重参数集合
${\mathcal{W}^{\mathrm{b}}} = \{ {{{\boldsymbol{W}}}_{z{j}}}{,}{{{\boldsymbol{W}}}_{h{j}}}{,}{{{\boldsymbol{W}}}^y}{,}{{{\boldsymbol{\beta}} }_{zj}}{,}{{{\boldsymbol{\beta}} }_{hj}}\}$	BLS模块的权重参数集合
${{{\boldsymbol{Z}}}^{m}},{{{\boldsymbol{Z}}}_{j}}$	映射特征矩阵，第 ${j}$ 个映射特征矩阵节点
$\phi_{j}, \xi_{j}$	第 ${j}$ 个非线性映射特征映射函数，第 ${j}$ 个非线性特征增强映射函数
${{{\boldsymbol{H}}}^{h}},{{{\boldsymbol{H}}}_{j}}$	特征增强矩阵，第 ${j}$ 个特征增强层节点
${\hat {\boldsymbol{Y}}}$	BLS输出层的输出矩阵
${\hat {\boldsymbol{r}}}_{{{v}^{d}}}^{u}$	用户 ${u}$ 与项 ${{v}^{d}}$ 的预测向量

下载: 导出CSV

| 显示表格

2.1 跨域特征提取

2.1.1 构造多部图

GBCD利用了 $({D} + 1) {\text{- }}$ 部图，使用公共用户作为桥梁来连接跨不同域的项. 这种方法可以实现不同域之间的间接联系，并便于在每个域中提取潜在的协同滤波嵌入向量. 假设在由 ${D}$ 个项域组成的跨域推荐任务中， ${\mathcal{U}}$ 为用户集， ${{\mathcal{V}}^{d}}$ 为第 ${d}$ 个项域中的项集， ${{{\boldsymbol{R}}}^{t}} \in {\mathbb{R}^{|\mathcal{U}| \times |{\mathcal{V}^{d}}|}}$ 为第 ${d}$ 个项域中的用户 ${\text{-}}$ 项评分矩阵， ${{\mathcal{G}}^{{D} + 1}} = \{ {\mathcal{U}},{\mathcal{V}},{\mathcal{E}}\}$ 表示 $({D} + 1) {\text{-}}$ 部图. 在 $({D} + 1) {\text{-}}$ 部图中， ${\cal V}=\{{\cal V}^{1},{\cal V}^{2},…,{\cal V}^{{D}}\}$ 表示 ${D}$ 个项域组成的项节点集，每个 ${{\mathcal{V}}^{d}}$ 对应于第 ${d}$ 个项域中的项集， ${\mathcal{E}} = \{ {{\mathcal{E}}^1}, {{\mathcal{E}}^2},…, {{\mathcal{E}}^{D}}\}$ 表示边集，每个 ${{\mathcal{E}}^{d}}$ 都是连接 ${\mathcal{U}}$ 和 ${{\mathcal{V}}^{d}}$ 之间节点的边，即在第 ${d}$ 个域中用户 ${\text{-}}$ 项的交互. 每条边的权重取决于用户相应项的评分. 对于 $({D} + 1) {\text{-}}$ 部图，表示多域用户项评分信息的加权邻接矩阵可以构造为

${\boldsymbol{A}} = \left( {\begin{array}{*{20}{c}} {{{0}}}&{{{\left( {{{\boldsymbol{R}}^1},…,{{\boldsymbol{R}}^{D}}} \right)}^{\mathrm{T}}}} \\ {({{\boldsymbol{R}}^1},…,{{\boldsymbol{R}}^{D}})}&{{{0}}} \end{array}} \right) .$

(1)

2.1.2 基于多部图的图卷积神经网络

在该模型中，设计了一个MGCN模块，用于处理和提取 $({D} + 1) {\text{-}}$ 部图中不同域之间潜在的高阶特征. 与普通的GCN网络相比，该模块摒弃了对特征聚合贡献较少的非线性激活部分，但保留权重矩阵的训练过程. 该网络定义为

${\boldsymbol{e}} = {{f}_{{\mathrm{MGCN}}}}({{\mathcal{G}}^{{D} + 1}};{{\mathcal{W}}^{\mathrm{g}}}) \text{，}$

(2)

其中 ${\boldsymbol{ e}}$ 为跨域协同滤波嵌入向量， ${{\mathcal{W}}^{\mathrm{g}}}$ 为MGCN模块的权重参数集合.

MGCN的核心思想是利用公共用户作为桥梁，在不同域的项之间建立连接，然后通过线性的GCN来聚合跨域信息学习这些实体的嵌入向量. 该方法可促进信息的递归传递或特征的传播. 具体来说，其计算步骤为

${{\boldsymbol{E}}^{({k} + 1)}} = {\hat {\boldsymbol{D}}^{ - 0.5}}\hat {\boldsymbol{A}}{\hat {\boldsymbol{D}}^{ - 0.5}}{{\boldsymbol{E}}^{k}}{{\boldsymbol{W}}^{k}} \text{，}$

(3)

其中 $\hat {\boldsymbol{A}} = {\boldsymbol{A}} + {\boldsymbol{I}}$ 是添加自连接的邻接矩阵， ${\boldsymbol{I}}$ 为单位矩阵， $\hat {\boldsymbol{D}}$ 是 $\hat {\boldsymbol{A}}$ 的度矩阵， ${{\boldsymbol{W}}^{k}}$ 是第 ${k}$ 层中的权值矩阵.

为了使用户-项的嵌入向量尽可能保留原始评分数据，特征矩阵 ${{\boldsymbol{E}}^0}$ 表示为

${{\boldsymbol{E}}}^{0}=\left(\begin{array}{cc}{\left({{\boldsymbol{R}}}^{1},…,{{\boldsymbol{R}}}^{{D}}\right)}^{{\mathrm{T}}}& {\boldsymbol{I}}\\ {\boldsymbol{I}}& ({{\boldsymbol{R}}}^{1},…,{{\boldsymbol{R}}}^{{D}}）\end{array}\right) .$

(4)

通过将特征矩阵 ${{\boldsymbol{E}}^0}$ 输入到MGCN中，可以得到跨域嵌入向量矩阵 ${\boldsymbol{E}} = {{\boldsymbol{E}}^{k}}$ ，其中 ${k}$ 为MGCN的层数. 从跨域嵌入向量矩阵 ${\boldsymbol{E}}$ 中，可以得到用户 ${{{u}}}$ 的嵌入向量，记为 ${\boldsymbol{U}}$ . 在第 ${d}$ 域中的项 ${v^{d}}$ 的跨域嵌入向量，记为 ${{\boldsymbol{V}}^{d}}$ . 连接形成跨域协同滤波嵌入为

${\boldsymbol{e}}_{{{v}^{d}}}^{u} = [{\boldsymbol{U}}|{{\boldsymbol{V}}^{d}}] \in {{\mathbb{R}}^{1 \times 2{N}}} \text{，}$

(5)

其中 ${N}$ 为跨域协同滤波嵌入的维数.

MGCN利用其特征提取能力来处理 $({D} + 1)$ -部图，通过公共用户实现不同域之间的间接连接，更高效地在每个域内提取潜在的高阶结构嵌入向量. 由此产生的跨域协作滤波嵌入向量捕获了跨不同域的用户和项之间的底层关系和交互，从而提高了系统的推荐性能.

2.2 跨域BLS

在此使用跨域BLS来映射从MGCN模块获得的跨域预测向量，以减轻潜在的噪声. 传统的BLS由3个主要部分组成：映射特征层、特征增强层和输出层. 这3个部分共同作用，以增强模型的鲁棒性和预测能力. BLS网络定义为

$\hat {\boldsymbol{r}} = {{f}_{{\mathrm{BLS}}}}({\boldsymbol{E}};{{\mathcal{W}}^{\mathrm{b}}}) \text{，}$

(6)

其中 ${\boldsymbol{E}} \in {\mathbb{R}^{|{D}| \times 2{N}}}$ 为不同用户-项 ${\boldsymbol{e}}_{{{v}^{d}}}^{u}$ 组合的矩阵， ${{\mathcal{W}}^{\mathrm{b}}}$ 为BLS模块的权重参数集合.

2.2.1 映射特征层

在映射特征层中，对嵌入进行初步处理使随机映射嵌入到映射特征矩阵节点 ${{\boldsymbol{Z}}_{j}} \in {\mathbb{R}^{|{D}| \times {{d}_{z}}}}$ 上，表示为

${{\boldsymbol{Z}}_{j}} = {\phi _{j}}({\boldsymbol{E}}{{\boldsymbol{W}}_{{zj}}} + {{\boldsymbol{\beta}} _{{zj}}}),{\text{ }}{j = }1,2,…,{m} \text{，}$

(7)

其中 $|{D}|$ 为样本大小， ${{d}_{z}}$ 为每个映射特征组的维数， ${m}$ 为映射特征组的个数， ${\phi _{j}}$ 为第 ${j}$ 个非线性映射特征映射函数. 在映射特征层中，采用了简单的线性变化函数. 与此同时，在上述过程中， ${{\boldsymbol{W}}_{{zj}}} \in {\mathbb{R}^{2{N} \times {{d}_{z}}}}$ 和 ${{\boldsymbol{\beta }}_{{zj}}} \in {\mathbb{R}^{|{D}| \times {{d}_{z}}}}$ 在初始化过程中随机生成. 然后，将映射特征层中的节点输出组合成映射特征矩阵 ${{\boldsymbol{Z}}^{m}}$ ，表示为

${{\boldsymbol{Z}}^{m}} =\left({{\boldsymbol{Z}}_1}|{{\boldsymbol{Z}}_2}|…|{{\boldsymbol{Z}}_{m}}\right) \in {\mathbb{R}^{|{D}| \times {m}{{d}_{z}}}} .$

(8)

2.2.2 特征增强层

特征增强层以映射特征层的输出 ${{\boldsymbol{Z}}^{m}}$ 作为输入，特征增强层节点 ${{\boldsymbol{H}}_{j}} \in {\mathbb{R}^{|{D}| \times {{d}_{h}}}}$ 计算为

${{\boldsymbol{H}}_{j}} = {\xi _{j}}({{\boldsymbol{Z}}^{m}}{{\boldsymbol{W}}_{{hj}}} + {{\boldsymbol{\beta}} _{{hj}}}),{\text{ }}{j = }1,2,…,{h} \text{，}$

(9)

其中 ${{d}_{h}}$ 表示每个特征增强组的维数， ${h}$ 表示特征增强组的个数， ${\xi _{j}}$ 为第 ${j}$ 个非线性特征增强映射函数. 在上述过程中，采用ReLu作为非线性映射函数. ${{\boldsymbol{W}}_{{hj}}} \in {\mathbb{R}^{{m}{{d}_{z}} \times {{d}_{h}}}}$ 和 ${{\boldsymbol{\beta }}_{{hj}}} \in {\mathbb{R}^{|{D}| \times {{d}_{h}}}}$ 在初始化过程中随机生成. 然后，将特征增强层中的节点输出组合成特征增强矩阵 ${{\boldsymbol{H}}^{h}}$ ，即

${{\boldsymbol{H}}^{h}} = ({{\boldsymbol{H}}_1}|{{\boldsymbol{H}}_2}|…|{{\boldsymbol{H}}_{h}}) \in {\mathbb{R}^{|{D}| \times {h}{{d}_{h}}}} .$

(10)

2.2.3 输出层

在输出层中，BLS模块用映射的特征矩阵 ${{\boldsymbol{Z}}^{m}}$ 和增强的特征矩阵 ${{\boldsymbol{H}}^{h}}$ 计算输出 $\hat {\boldsymbol{Y}}$ ，即

$\hat {\boldsymbol{Y}} = ({{\boldsymbol{Z}}^{m}}|{{\boldsymbol{H}}^{h}}){{\boldsymbol{W}}^{y}} \text{，}$

(11)

其中 ${{\boldsymbol{W}}^{y}} \in {\mathbb{R}^{({m}{{d}_{z}}{ + h}{{d}_{h}}) \times {{d}_{y}}}}$ 为可训练的权重矩阵， ${{d}_{y}}$ 为输出标签的数量. 在训练过程中，只需要调整可训练的矩阵，这可以通过使用岭回归算法得到一个伪逆矩阵来近似，即

${{\boldsymbol{W}}^{y}} = {({{\boldsymbol{Z}}^{m}}|{{\boldsymbol{H}}^{h}})^ + }{\boldsymbol{Y}} .$

(12)

在获得BLS模块输出层的输出矩阵 $\hat {\boldsymbol{Y}}$ 后，便可以得到相对应的用户 ${u}$ 与项 ${{v}^{d}}$ 的预测向量 $\hat {\boldsymbol{y}}_{{{v}^{d}}}^{u}$ . 同时，需要对其进行规范化和加权计算预测用户-项评分 $\hat {\boldsymbol{r}}_{{{v}^{d}}}^{u}$ ，即

$\hat {\boldsymbol{r}}_{{{v}^{d}}}^{u} = \sum\limits_{{j} = 1}^{{{d}_{y}}} {\frac{{\hat {\boldsymbol{y}}_{{{v}^{d}}}^{u}[{j}] - \min\left( {\hat {\boldsymbol{y}}_{{{v}^{d}}}^{u}} \right)}}{{\max\left( {\hat {\boldsymbol{y}}_{{{v}^{d}}}^{u}} \right) - \min\left( {\hat {\boldsymbol{y}}_{{{v}^{d}}}^{u}} \right)}}} \hat {\boldsymbol{y}}_{{{v}^{d}}}^{u}[{j}] .$

(13)

2.3 损失函数及方法

使用传统的基于GCN方法训练MGCN网络时，由于在图网络中使用MGCN聚合邻居节点信息会导致学习不合理的交互信息，学习到的高阶信息可能不够准确和合理，导致模型预测精度较低、模型鲁棒性较差. 为克服这一问题，本文利用BLS的随机映射特性来增强MGCN模型的鲁棒性. 因此，GBCD方法不使用中间结果进行MGCN的训练，而是端对端对网络进行更新，直接利用最终推荐任务的输出也就是BLS的输出作为优化目标，即面向任务的训练优化方法. 在本文中，最终推荐任务的目标是预测评分，GBCD方法的损失函数表示为

$\mathop {\min}\limits_{{{\mathcal{W}}^{\mathrm{g}}},{{\mathcal{W}}^{\mathrm{b}}}} \frac{1}{{|{\mathcal{R}}|}}\sum\limits_{{\boldsymbol{r}}_{{{v}^{d}}}^{u} \in {\mathcal{R}}} {{{\left( {{\boldsymbol{r}}_{{{v}^{d}}}^{u} - \hat {\boldsymbol{r}}_{{{v}^{d}}}^{u}} \right)}^2}} ,$

(14)

其中 ${\mathcal{R}}$ 表示输入样本集合.

算法1提供了本文所提出的GBCD的伪代码.

算法1. 图卷积宽度跨域推荐系统(GBCD).

输入： $({D} + 1){\text{ -}}$ 部图 ${{\mathcal{G}}^{{D} + 1}}$ ，映射特征组的个数 ${m}$ ，映射特征组的维数 ${{d}_{m}}$ ，特征增强组的个数 ${h}$ ，每个特征增强组的维数 ${{d}_{h}}$ ，训练轮数 ${{N}_{{\mathrm{epoch}}}}$ .

输出：用户-项的预测评分 $\hat {\boldsymbol{r }}$ .

① 初始化权重参数 ${{\mathcal{W}}^g}$ 和 ${{\mathcal{W}}^b}$ ；

② ${{\boldsymbol{A}}}$ , ${{{\boldsymbol{E}}}^0}$ $\leftarrow$ ${{\mathcal{G}}^{{D} + 1}}$ ； /*将 $({D} + 1){\text{ -}}$ 部图转换为相应加权邻接矩阵和特征矩阵*/

③ for ${{N}_{{\mathrm{epoch}}}}$ do

④ 　 ${{\boldsymbol{E}}} \leftarrow {{f}_{{\mathrm{MGCN}}}}({{\boldsymbol{A}}},{{{\boldsymbol{E}}}^0};{\mathcal{W}^{\mathrm{g}}})$ ；/*通过MGCN提取和学习不同域之间的嵌入向量*/

⑤ 　for ${j}$ in 1 to ${m}$ do

⑥　　 ${{{\boldsymbol{Z}}}_{j}} \leftarrow {\phi _{j}}({{\boldsymbol{E}}}{{{\boldsymbol{W}}}_{{zj}}} + {{{\boldsymbol{\beta}} }_{{zj}}})$ ；/*对嵌入向量进行随机映射生成BLS映射层输出*/

⑦　 end for

⑧　 ${{{\boldsymbol{Z}}}^{m}} \leftarrow({{{\boldsymbol{Z}}}_1}|{{{\boldsymbol{Z}}}_2}|…|{{{\boldsymbol{Z}}}_{m}})$ ；/*映射特征层中的节点输出组合成映射特征矩阵*/

⑨　 for ${j}$ in 1 to ${h}$ do

⑩　　 ${{{\boldsymbol{H}}}_{j}} = {\xi _{j}}({{{\boldsymbol{Z}}}^{m}}{{{\boldsymbol{W}}}_{{hj}}} + {{{\boldsymbol{\beta}} }_{{hj}}})$ ；/*对BLS映射层输出进行非线性变化生成BLS增强层输出*/

⑪ 　end for

⑫　 ${{{\boldsymbol{H}}}^{h}} \leftarrow ({{{\boldsymbol{H}}}_1}|{{{\boldsymbol{H}}}_2}|…|{{{\boldsymbol{H}}}_{h}})$ ；/*特征增强层中的节点输出组合成特征增强矩阵*/

⑬　 ${{{\boldsymbol{W}}}^{y}} \leftarrow {({{{\boldsymbol{Z}}}^{m}}|{{{\boldsymbol{H}}}^{h}})^ + }{{\boldsymbol{Y}}}$ ；/*通过使用岭回归算法得到一个近似的伪逆矩阵与标签信息矩阵 ${\boldsymbol{Y}}$ ，计算出所需权重矩阵*/

⑭　 ${\hat {\boldsymbol{Y}}} \leftarrow ({{{\boldsymbol{Z}}}^{m}}|{{{\boldsymbol{H}}}^{h}}){{{\boldsymbol{W}}}^y}$ ；

⑮　 ${\hat {\boldsymbol{r}}} \leftarrow {\hat {\boldsymbol{Y}}}$ ；

⑯　计算并最小化式（14）；/*通过最小化该损失函数学习GBCD*/

⑰ end for

⑱ 返回 $\hat {\boldsymbol{r}}$ .

3. 实　　验

在本节实验中，我们打算回答3个研究问题：

研究问题1. 为什么有必要使用CDR方法，以及利用来自源域的信息是否能提高其有效性. 此外，与其他最先进的跨域方法相比，我们提出的GBCD方法性能表现如何.

研究问题2. 利用MGCN聚合多部图的特征是否有优势. 此外，结合BLS随机映射的特征是否增强了模型的鲁棒性.

研究问题3. 超参数如何影响GBCD的性能.

3.1 实验设置

3.1.1 数据集和评估指标

根据现有文献[2]，本文实验使用2个具有多个项域的真实公共数据集，即Amazon数据集和MovieLens数据集，如表2所示.

表 2 实验中使用的2个数据集

Table 2. Two Datasets Used in Experiments

数据集	域	用户数	项数	评分数	密度/%
Amazon	Books	12761	7346	85400	0.09
	CDs	12761	2541	85865	0.27
	Music	12761	778	28680	0.29
	Movies	12761	8270	188507	0.18
	Beauty	30000	302782	345231	0.01
	Fashion	30000	146794	140648	0.01
MovieLens	COM	2113	3029	332038	5.19
	DRA	2113	3975	381616	4.54
	ACT	2113	1277	241211	8.94
	THR	2113	1460	226975	7.36

下载: 导出CSV

| 显示表格

1）Amazon. 该数据集包含1996年5月至2018年10月的2.331亿条评论（评分），每个记录均为一个元组（用户、项、评分、时间戳）. 由于数据集的规模相当大，评分记录很少的用户倾向于对随机项进行评分，这将降低效率和有效性. 因此，在我们的实验中按照文献[2]的规定，将在4个域汇总评分记录中数量小于5的用户和项删除，即书本（books）、光盘（CDs）、音乐（music）和电影（movies）. 同时，我们保留2个域为原始数据集大小，即美容（beauty）和时尚（fashion）.

2）MovieLens. 数据集来自马德里自治大学的信息检索组，该数据集包含2113名用户、10197部电影、855598个1970―2009年的电影评分. 我们使用电影的标签将电影划分为18个域，并在我们的实验中使用了4个电影域，即喜剧（COM）、戏剧（DRA）、动作（ACT）和惊悚（THR）.

如表3所示，我们从这2个不同的数据集中定义了23个CDR任务.

表 3 跨域任务的统计信息

Table 3. Statistics of the Cross Domain Tasks

数据集	CDR任务	源域	目标域
Amazon	1	Books	CDs
	2	Books	Music
	3	Books	Movies
	4	CDs	Music
	5	CDs	Movies
	6	Music	Movies
	7	Books+CDs	Music
	8	Books+CDs	Movies
	9	Books+Music	Movies
	10	CDs+Music	Movies
	11	Books+CDs+Music	Movies
	12	Beauty	Fashion
MovieLens	1	COM	DRA
	2	COM	ACT
	3	COM	THR
	4	DRA	ACT
	5	DRA	THR
	6	ACT	THR
	7	COM+DRA	ACT
	8	COM+DRA	THR
	9	COM+ACT	THR
	10	DRA+ACT	THR
	11	COM+DRA+ACT	THR

下载: 导出CSV

| 显示表格

3.1.2 对比方法

由于GBCD属于CDR方法的类别，本文的重点是将其性能与经典的和最先进的CDR方法进行比较. 因此，我们选择6种方法作为对比算法： 1）单域推荐模型（target，TGT）是一种经典的单域MF模型，仅使用目标域数据进行训练. 2）CMF^[44]是MF的扩展，它考虑了目标域和源域的交互矩阵，在这2个域之间共享用户的嵌入内容. 3）L-GCN^[43]是一个简化的图卷积推荐方法，它是一个单领域方法. 我们将多个域的数据进行合并，利用L-GCN在合并数据上进行推荐计算. 4）EMCDR^[46]是一种常用的冷启动CDR方法. 它将用户偏好编码为源域和目标域中的向量，然后学习一个映射函数，将用户向量从源域映射到目标域. 5）PTUP^[27]是一种个性化的桥接CDR方法，它通过学习由用户特征嵌入组成的元网络来定制用户桥接. PTUP提供了3种变体版本PTUP-MF，PTUP-DNN，PTUP-GMF，每种版本都使用不同的模型进行个性化桥接. 6）DisenCDR^[46]通过解耦领域共享和领域特定信息以及利用互信息规则来增强跨域推荐性能.

3.1.3 实施细节

GBCD方法以及对比方法均基于PyTorch实现. 其中TGT，CMF，EMCDR的实现是由PTUP的公开代码一并实现的. Adam优化器的初始学习速率在{0.001，0.001，0.005，0.01，0.02，0.02，0.1}范围内使用网格搜索进行调整. 另外，所有模型的批处理大小均设置为256，每个模型的嵌入维度为10.

在GBCD方法中，我们将映射特征组数量m和特征增强组数量h均设置为25，映射的特征维度 ${{d}_{m}}$ 设置为10，增强的特征维度 $d_h$ 设置为15. 将测试用户设置为25%的重叠用户. 所有的实验均在一台拥有英特尔酷睿i9-10900 CPU，GeForce RTX 3090的服务器上运行. GBCD的代码可以在https://github.com/BroadRS/GBCD下载.

3.2 性能比较(研究问题1)

在本节中，我们将介绍实验结果，并深入讨论将GBCD方法应用于23个跨域任务数据集的实验效果. 实验结果如和所示，其中MAE和RMSE分别表示平均绝对误差和均方差. 结果如下：首先，TGT 是一种单域模型，只利用目标域的数据，而忽略源域的数据，其性能并不令人满意. 相比之下，其他利用源域数据进行跨域推荐的跨域方法始终优于单域TGT. 因此，结合源域数据的方法被证明是缓解数据稀疏性和提高目标域推荐性能的有效方法. 其次，CMF，L-GCN将多个域的数据合并到一个域中，并共享用户的嵌入，但在大多数任务中特别是在Amazon数据集，CMF，L-GCN的表现都差于CDR. 造成这种差异的原因是，CMF，L-GCN对来自不同域的数据一视同仁，从而忽略了潜在的特定域特征. 另一方面，CDR通过采用特定方法将源域嵌入转化为目标特征空间，从而有效解决域转移问题. L-GCN相比CMF性能优异，这是因为与CMF相比，L-GCN考虑到了用户与项之间的高阶交互信息. 最后，值得注意的是，在大多数情况下，与跨域对比算法相比，GBCD的性能始终优于最佳对比方法. 这是由于与L-GCN相比,GBCD通过 $({D} + 1){\text{ -}}$ 部图来提取跨域特征信息，比将多个域的数据进行简单合并更加有效. 与EMCDR，PTUP，DisenCDR相比，GBCD利用MGCN在 $({D} + 1){\text{ -}}$ 部图上显式捕捉了不同域之间高阶的交互信息，进而提升了跨域推荐的性能. 这进一步证明了GBCD在跨域推荐方面的有效性.

表 4 在Amazon数据集上的性能结果比较

Table 4. Comparison of Performance Results on Amazon Dataset

任务	评估指标	方法									提升度/%
任务	评估指标	TGT	CMF	L-GCN	EMCDR	PTUP-MF	PTUP-DNN	PTUP-GMF	DisenCDR	GBCD（本文）	提升度/%
1	MAE	4.4126	2.0761	1.4284	2.9807	1.4095	0.8481	1.2199	1.1426	0.8077	4.76
1	RMSE	5.1390	2.8938	1.5315	3.3968	1.9718	1.1655	1.7455	1.3226	1.0124	13.14
2	MAE	4.4121	2.2234	1.4501	3.3254	1.3885	0.8179	1.1673	1.0112	0.7715	5.67
2	RMSE	5.1441	3.0424	1.5729	3.6547	1.9173	1.1287	1.7027	1.1812	1.0033	11.11
3	MAE	4.2753	1.9922	1.3649	3.1422	1.1931	0.8388	1.0567	0.9796	0.8243	1.73
3	RMSE	4.9974	2.7055	1.4885	3.5375	1.6146	1.1178	1.4960	1.3086	1.0532	5.78
4	MAE	4.4090	1.0745	1.3875	1.5591	1.0396	0.7881	0.9428	1.2650	0.7605	3.50
4	RMSE	5.1440	1.6133	1.5045	1.9730	1.5209	1.0809	1.4601	1.3874	0.9743	9.86
5	MAE	4.2662	1.1581	1.3549	1.1762	0.8572	0.7978	0.8468	1.0165	0.7959	0.24
5	RMSE	4.9697	1.6362	1.4892	1.5842	1.1494	1.0540	1.1155	1.2779	1.0151	3.69
6	MAE	4.2423	1.0408	1.4625	1.0026	0.8332	0.8162	0.8270	1.2322	0.8071	1.11
6	RMSE	4.9304	1.4686	1.5985	1.3383	1.1076	1.0425	1.0904	1.3083	0.9816	5.84
7	MAE	4.4705	0.9866	1.3899	1.5988	1.0335	0.7589	0.9412	1.0480	0.7494	1.25
7	RMSE	5.2000	1.4476	1.5012	1.9597	1.4720	1.0515	1.3644	1.2216	0.9954	5.34
8	MAE	4.3285	1.0726	1.3371	1.0938	0.8753	0.8245	0.8512	1.2172	0.8146	1.20
8	RMSE	5.0042	1.5010	1.4651	1.4726	1.1611	1.0525	1.1244	1.3182	1.0201	3.08
9	MAE	4.2627	1.0081	1.3632	0.9862	0.8467	0.7761	0.8210	1.1562	0.8035	$-$ 3.53
9	RMSE	4.9398	1.4133	1.4943	1.3093	1.1210	1.0398	1.0867	1.3172	0.9995	3.88
10	MAE	4.2112	0.9572	1.3493	0.9766	0.8548	0.7802	0.8222	1.3246	0.7789	0.16
10	RMSE	4.8905	1.3365	1.4851	1.2701	1.1440	1.0318	1.0898	1.4514	0.9740	5.60
11	MAE	4.4446	0.9610	1.3343	0.9786	0.8702	0.7821	0.8264	1.3118	0.7530	3.72
11	RMSE	5.1121	1.3287	1.4571	1.2771	1.1578	1.0360	1.1043	1.4714	0.9586	7.47
12	MAE	4.3761	4.1337	2.5227	3.9229	2.1195	2.0704	2.1029	3.2112	1.4552	29.71
12	RMSE	5.2022	4.7734	2.9646	4.1703	2.6949	2.6547	2.6882	4.1867	1.9684	25.85
注：加粗为最优结果，提升度=(最佳基线性能 $-$ GBCD的性能)/最佳基线性能.

下载: 导出CSV

| 显示表格

表 5 在MovieLens数据集上的性能结果比较

Table 5. Comparison of Performance Results on MovieLens Dataset

任务	评估指标	方法									提升度/%
任务	评估指标	TGT	CMF	L-GCN	EMCDR	PTUP-MF	PTUP-DNN	PTUP-GMF	DisenCDR	GBCD(本文)	提升度/%
1	MAE	3.5187	0.7259	0.7206	0.7093	0.7087	0.6912	0.6943	0.9793	0.6550	5.66
1	RMSE	4.0828	0.9454	0.9892	0.9183	0.9203	0.9035	0.9066	1.0842	0.8523	5.99
2	MAE	3.4289	0.7524	0.8395	0.7484	0.7461	0.7289	0.7343	0.8931	0.6793	7.49
2	RMSE	4.0343	0.9715	1.0485	0.9637	0.9689	0.9377	0.9509	0.9520	0.8720	8.30
3	MAE	3.6157	0.7597	0.8823	0.7402	0.7337	0.7203	0.7217	0.8597	0.6732	6.72
3	RMSE	4.2084	0.9847	1.0985	0.9562	0.9481	0.9339	0.9412	0.9576	0.8657	8.02
4	MAE	3.3605	0.7036	0.8390	0.7131	0.7086	0.6924	0.6903	0.9571	0.6834	1.00
4	RMSE	3.9528	0.9122	1.0550	0.9178	0.9152	0.8945	0.8944	1.0122	0.8861	0.93
5	MAE	3.5322	0.7205	0.8955	0.7380	0.7130	0.7056	0.7025	0.9522	0.6817	2.96
5	RMSE	4.1614	0.9349	1.1055	0.9499	0.9291	0.9158	0.9157	1.0321	0.8754	4.40
6	MAE	3.5005	0.7063	0.874	0.7170	0.7064	0.6971	0.6911	0.9160	0.6643	3.88
6	RMSE	4.1662	0.9169	1.0900	0.9241	0.9168	0.9020	0.8939	0.9774	0.8584	3.97
7	MAE	3.4866	0.7187	0.8145	0.7360	0.7153	0.7052	0.7002	0.9363	0.6985	0.86
7	RMSE	4.0680	0.9342	1.0435	0.9473	0.9245	0.9251	0.9097	0.9871	0.9019	0.24
8	MAE	3.4522	0.7260	0.8760	0.7480	0.7303	0.7133	0.7056	0.9279	0.6719	4.78
8	RMSE	4.0307	0.9446	1.1065	0.9618	0.9391	0.9246	0.9206	0.9939	0.8758	4.87
9	MAE	3.4793	0.7324	0.8765	0.7379	0.7282	0.7161	0.7111	0.9040	0.6793	4.47
9	RMSE	4.1414	0.9467	1.1080	0.9467	0.9406	0.9231	0.9183	0.9585	0.8720	5.04
10	MAE	3.5241	0.7147	0.8870	0.7395	0.7124	0.6995	0.6956	0.8730	0.6641	4.52
10	RMSE	4.1439	0.9283	1.1070	0.9517	0.9283	0.9120	0.9078	0.9313	0.8580	5.49
11	MAE	3.5005	0.7063	0.7057	0.7170	0.7125	0.7064	0.6995	0.9325	0.6701	4.20
11	RMSE	4.1662	0.9169	1.1069	0.9241	0.9122	0.9168	0.8946	0.9870	0.8691	2.85
注：加粗为最优结果，提升度= (最佳基线性能−GBCD的性能)/最佳基线性能.

下载: 导出CSV

| 显示表格

GBCD的时间复杂度主要来自于2部分：MGCN模块和BLS模块. MGCN模块每层在进行图卷积的过程中，每个节点都需要与邻居节点进行信息交换. 在每次的信息交换中，都需要对 ${N}$ 维的特征向量进行操作，所以MGCN每层的时间复杂度为 ${O}{(|}{\varepsilon }{|}{N}{)}$ . BLS模块映射特征层生成中需要进行归一化和矩阵运算等操作，时间复杂度约为 ${O}{(|}{D}{|}N{m}{{d}_{m}}{)}$ ，增强特征层需要进行非线性激活和矩阵运算等操作，时间复杂度约为 ${O}{(|}{D}{|}{m}{{d}_{m}}{h}{{d}_{h}}{)}$ ，所以BLS模块的时间复杂度约为 ${O}{(|}{D}{|(}N{m}{{d}_{m}} + {m}{{d}_{m}}{h}{{d}_{h}}{))}$ . 各种方法的训练时间与时间复杂度、早停策略、批量大小有关.

表6和表7给出GBCD与对比方法在23个跨域任务上的训练时间. 从表6和表7中可以看出，基于图模型的训练时间在大部分跨域任务的耗时高于传统模型，如L-GCN和GBCD的训练耗时高于其他的对比方法. 这是由于基于图的模型需要处理图结构，导致计算复杂度增加，以及GCN所采用的全批量训练方法导致收敛较慢. 同时，由于Amazon数据集相比MovieLens数据集评分数较少，所以GBCD在MovieLens数据集上的大部分跨域任务比在Amazon数据集上的跨域任务更加耗时.

表 6 在Amazon数据集上的训练时间

Table 6. Training Time on Amazon Dataset s

任务	方法
任务	TGT	CMF	L-GCN	EMCDR	PTUP-MF	PTUP-DNN	PTUP-GMF	DisenCDR	GBCD (本文)
1	53.64	79.95	645.80	11.07	125.85	126.6	123.66	502.80	485.86
2	50.82	70.92	370.23	10.92	74.25	77.82	72.57	301.23	343.17
3	97.05	117.78	844.94	17.07	137.37	149.79	142.65	962.10	809.02
4	50.52	93.45	386.85	11.34	69.84	76.89	75.27	285.08	395.48
5	98.01	140.61	444.71	16.56	141.30	152.82	147.57	837.20	667.92
6	97.68	151.17	695.96	17.43	141.39	149.67	148.53	972.04	621.51
7	52.35	119.58	640.57	11.04	69.72	77.82	73.20	235.24	618.98
8	96.315	170.58	563.02	17.36	143.10	177.72	157.59	556.95	1048.18
9	100.29	167.73	471.31	17.52	137.25	145.86	142.62	456.99	918.02
10	99.12	200.52	470.87	16.92	134.64	147.72	142.44	4475.96	906.53
11	117.45	275.55	1196.02	16.98	144.69	161.52	154.86	3722.65	1138.08
12	238.32	1067.19	4229.78	24.07	175.64	173.42	179.11	5308.28	3902.21

下载: 导出CSV

| 显示表格

表 7 在MovieLens数据集上的训练实验

Table 7. Training Time on MovieLens Dataset s

任务	方法
任务	TGT	CMF	L-GCN	EMCDR	PTUP-MF	PTUP-DNN	PTUP-GMF	DisenCDR	GBCD (本文)
1	155.07	219.78	967.29	24.87	256.44	258.33	264.57	2381.35	1947.48
2	216.75	292.02	907.18	31.53	352.05	347.73	347.22	1579.22	1597.79
3	347.22	320.64	867.99	39.09	426.90	426.42	430.53	1525.28	1478.47
4	219.15	387.15	872.87	31.89	341.13	350.64	362.67	1822.40	1719.06
5	268.89	447.06	872.75	41.52	427.35	436.35	455.10	1522.30	1661.04
6	402.18	724.77	774.60	49.62	438.03	433.20	707.58	1147.83	1223.79
7	257.52	566.34	1212.95	35.10	356.19	392.37	408.51	1421.16	2633.87
8	332.25	598.35	1186.57	42.51	451.23	461.94	476.58	1333.53	2593.37
9	278.28	680.82	1055.66	40.35	457.86	474.75	466.12	1004.09	2158.97
10	326.55	819.60	1111.12	43.08	529.47	464.76	468.04	2423.01	2358.87
11	218.73	906.75	1508.44	40.89	579.32	523.79	518.25	4297.78	3610.48

下载: 导出CSV

| 显示表格

3.3 消融实验（研究问题2）

在本节中，我们将进行实验，分析GBCD的不同组成部分，并开发2种变体，以更好地验证其有效性. 其中，GCD是在GBCD的基础上去掉BLS模块的变体； GMCD采用MLP代替BLS模块；GATBCD采用图注意网络(GAT)代替GCN. 在此使用MAE和 RMSE指标评估了GBCD及其变体的性能，结果见表8和表9. 表8和表9中的结果清楚地表明，GBCD模型的性能优于其他2个变体：GCD和GMCD. 与这2个变体相比，GBCD的性能提升幅度高达22.41%. 这表明，通过使用 BLS 随机映射功能可增强了模型的鲁棒性，有助于提高模型的预测性能. 此外，表8和表9还显示，GBCD模型显著优于GATBCD. 这表明与使用GAT相比，利用GCN聚合邻域特征有利于提高模型的预测性能. 这些实验结果不仅验证了GBCD不同组件的有效性，还可以看出使用BLS增强模型鲁棒性和利用GCN聚合图邻域特征对提高模型预测性能的重要性.

表 8 在Amazon数据集上的消融实验

Table 8. Ablation Study on Amazon Dataset

任务	评估指标	方法				提升度/%
任务	评估指标	GCD	GMCD	GATBCD	GBCD (本文)	提升度/%
1	MAE	1.0304	0.8708	1.2419	0.8077	7.24
1	RMSE	1.1403	1.2367	1.6498	1.0124	11.22
2	MAE	1.0181	0.8317	1.2123	0.7715	7.24
2	RMSE	1.1272	1.1987	1.6525	1.0033	11.00
3	MAE	0.9201	0.8538	1.1032	0.8243	3.56
3	RMSE	1.1974	1.1821	1.5033	1.0532	10.90
4	MAE	0.8360	0.8200	1.0054	0.7605	7.26
4	RMSE	1.1440	1.2187	1.3767	0.9743	14.83
5	MAE	0.8508	0.8636	0.9839	0.7959	6.45
5	RMSE	1.1697	1.2099	1.3809	1.0151	13.21
6	MAE	0.9423	0.8823	1.0313	0.8071	8.52
6	RMSE	1.2304	1.1869	1.2167	0.9816	17.30
7	MAE	1.0258	0.8918	0.9943	0.7494	15.97
7	RMSE	1.2500	1.2375	1.3066	0.9954	19.56
8	MAE	0.9285	0.8952	1.0256	0.8146	9.00
8	RMSE	1.1042	1.2128	1.2179	1.0201	7.62
9	MAE	0.9237	0.9068	0.9482	0.8035	11.39
9	RMSE	1.1732	1.2051	1.2390	0.9995	14.81
10	MAE	0.9180	0.8829	0.9670	0.7789	11.78
10	RMSE	1.2905	1.2429	1.2608	0.9740	21.63
11	MAE	0.8412	0.8695	0.9520	0.7530	13.40
11	RMSE	1.2553	1.2115	1.2604	0.9586	20.87
12	MAE	1.6986	1.7228	2.1872	1.4552	14.33
12	RMSE	2.1740	2.2176	2.3449	1.9684	9.45
注：加粗为最优结果，提升度= (最佳基线性能−GBCD的性能)/最佳基线性能.

下载: 导出CSV

| 显示表格

表 9 在MovieLens数据集上的消融研究

Table 9. Ablation Study on MovieLens Dataset

任务	评估指标	方法				提升度/%
任务	评估指标	GCD	GMCD	GATBCD	GBCD (本文)	提升度/%
1	MAE	0.9017	0.7796	1.0419	0.6550	15.98
1	RMSE	1.3749	0.9827	1.2809	0.8523	13.27
2	MAE	0.8559	0.8223	1.0040	0.6793	17.39
2	RMSE	1.1631	1.0625	1.1608	0.8720	17.93
3	MAE	0.8900	0.8091	0.8969	0.6732	16.80
3	RMSE	1.1758	1.0174	1.1699	0.8657	14.91
4	MAE	0.8503	0.8383	0.8799	0.6834	18.48
4	RMSE	1.1044	1.0598	1.1554	0.8861	16.39
5	MAE	0.8305	0.8121	1.1041	0.6817	16.06
5	RMSE	1.0946	1.0433	1.3024	0.8754	16.09
6	MAE	0.8354	0.7888	0.8375	0.6643	15.78
6	RMSE	1.0966	0.9977	1.0393	0.8584	13.96
7	MAE	0.9274	0.8845	0.9139	0.6985	21.03
7	RMSE	1.2729	1.1225	1.2139	0.9019	19.65
8	MAE	0.8365	0.8173	0.8518	0.6719	17.79
8	RMSE	1.1261	1.0384	1.2059	0.8758	15.66
9	MAE	0.8105	0.8008	0.8855	0.6793	15.17
9	RMSE	1.0747	1.0103	1.0972	0.8720	13.69
10	MAE	0.8734	0.8559	0.8971	0.6641	22.41
10	RMSE	1.1123	1.0838	1.1897	0.8580	20.83
11	MAE	0.8403	0.8063	0.9081	0.6701	16.89
11	RMSE	1.0406	1.0169	1.1133	0.8691	14.53
注：加粗为最优结果，提升度= (最佳基线性能 $-$ GBCD的性能)/最佳基线性能.

下载: 导出CSV

| 显示表格

3.4 超参数分析（研究问题3）

在本节中，我们将对GBCD方法的超参数即映射特征组数 ${m}$ 、映射特征维数 ${{d}_{m}}$ 、特征增强组数 ${h}$ 和特征增强维数 ${{d}_{h}}$ 进行敏感性分析. 在对1个参数进行分析时，其他参数都是固定的. 通过这种灵敏度分析，我们旨在研究每个超参数对GBCD方法性能的影响. 限于篇幅，我们仅给出2个数据集上的任务1，6，7，11，其他任务上的结果可以类似得到.

3.4.1 特征映射模块中的超参数

在敏感性分析中，首要分析GBCD方法中特征映射模块的超参数，包括映射特征组数 ${m}$ 和映射特征维数 ${{d}_{m}}$ . 为分析 GBCD方法对这2个超参数的敏感性，在{15，20，25，30，35}范围内测试 ${m}$ ，在{5，10，15，20，25}范围内测试 ${{d}_{m}}$ . 在此用MAE，RMSE评估不同值对性能的影响，结果如图3和所示. 从和中可以看出， GBCD方法对特征映射层中超参数的敏感度相对较低. 这表明在测试范围内，GBCD方法的性能不易受到 ${m}$ 和 ${{d}_{m}}$ 的特定值的影响.

图 3 GBCD对不同映射特征组数

${m}$ 获得的RMSE和MAE值分析

Figure 3. Analysis of RMSE and MAE values by GBCD with different number of mapped feature groups

${m}$

下载: 全尺寸图片幻灯片

图 4 GBCD通过不同数量的映射特征维数

${{d}_{m}}$ 获得的RMSE和MAE值分析

Figure 4. Analysis of RMSE and MAE values by GBCD with different number of mapped feature dimensions

${{d}_{m}}$

下载: 全尺寸图片幻灯片

3.4.2 特征增强模块中的超参数

在敏感性分析中，在此分析GBCD方法特征增强模块中的超参数，包括特征增强组数 ${h}$ 和特征增强维数 ${{d}_{h}}$ . 为分析GBCD方法对这些超参数的敏感性，在{15，20，25，30，35}范围内测试 ${h}$ ，在{5，10，15，20，25}范围内测试 ${{d}_{h}}$ . 在此同样使用MAE和RMSE评估了不同值对性能的影响，结果如图5和所示. 从和中可以看出，GBCD方法对特征增强模块中超参数的敏感度相对较低. 这表明在测试范围内，GBCD方法的性能不易受到 ${h}$ 和 ${{d}_{h}}$ 的特定值的影响.

图 5 GBCD通过不同特征增强组数

${h}$ 获得的RMSE和MAE值分析

Figure 5. Analysis of RMSE and MAE values obtained by GBCD with different number of enhanced feature groups

${h}$

下载: 全尺寸图片幻灯片

图 6 GBCD通过不同特征增强维数

${{d}_{h}}$ 得到的RMSE和MAE值分析

Figure 6. Analysis of RMSE and MAE values by GBCD with different number of enhanced feature dimensions

${{d}_{h}}$

下载: 全尺寸图片幻灯片

3.5 案例分析

我们进行了案例分析以验证GBCD的有效性，表10中列出了在Amazon和MovieLens数据集中部分跨域任务的具体案例，其中“真实评分”表示原始的真实评分，“预测评分”表示GBCD和部分基线的预测评分. 如表10所示，GBCD在大部分的具体案例上的表现更好. 跨域推荐的基线都比单域的基线在大多数的具体案例下表现得要更好，这说明结合源域数据被证明是缓解数据稀疏性和提高目标域推荐性能的有效方法.

表 10 GBCD有效性的案例研究

Table 10. Case Study of the Effectiveness of GBCD

数据集	任务	用户编号	项编号	真实评分	不同算法的预测评分
数据集	任务	用户编号	项编号	真实评分	TGT	L-GCN	EMCDR	PTUP-DNN	DisenCDR	GBCD (本文)
Amazon	1	360	8874	1	2.8748	3.3818	1.5408	1.8930	1.6353	1.3503
	6	1569	5363	5	2.2330	3.1874	3.1622	4.0144	3.6050	4.4975
	7	4067	15249	5	1.1260	4.0448	3.7357	4.6258	4.7390	4.6835
	11	1188	27555	5	1.7055	3.4451	4.5359	4.7281	3.5772	4.7434
MovieLens	1	414	4042	4	0.7557	4.4624	2.2757	4.1191	3.3242	4.4395
	6	814	4220	3	1.1037	3.2118	2.0289	3.6359	3.1161	3.0989
	7	1703	7248	5	0.1681	3.8297	3.0117	4.2053	3.4718	4.0739
	11	896	12777	3	1.9577	2.7000	2.1512	2.8680	3.9828	3.0385
注:加粗为最优结果.

下载: 导出CSV

| 显示表格

3.6 可视化分析

在本节中，我们进一步分析GBCD是否通过 $({D} + 1){\text{ -}}$ 部图提取到高阶的交互信息从而学习到更好的用户和项表征来提升推荐的性能. 为此，我们从MovieLens数据集的COM到DRA跨域任务中随机选择3名用户及相关项，并在图7中给出从GBCD与PTUP-DNN提取的用户和项跨域表征进行对比. 通过图7（a）（b）联合分析（如2102和2105）可以发现：GBCD提取到的用户历史项的跨域嵌入往往更加接近，该现象验证了GBCD相比传统的跨域推荐算法（如PTUP）能够捕捉到更复杂的用户-项的高阶交互信息.

图 7 从 GBCD和PTUP-DNN中学习得到的用户和项特征经过t-SNE转换可视化

每个星形形状代表 MovieLens 数据集DRA域中的某个具体编号的用户，相同形状的圈代表对应用户对应的交互项.

Figure 7. User and item features learned in GBCD and PTUP-DNN visualised by t-SNE transformation

下载: 全尺寸图片幻灯片

4. 结　　论

在本文中，我们建立了一个基于图卷积宽度跨域推荐系统（GBCD）. 该方法引入多项关键创新以提高模型的性能和鲁棒性. 首先，将多域用户-项交互图建模为(D + 1)-部图，从而能够探索更高阶的特征. 其次，利用图卷积网络（GCN）来学习这些高阶特征，从而捕捉跨域用户与项之间的复杂关系. 最后，采用BLS来增强模型的鲁棒性，从而提高其预测能力. 此外，我们还提出了一种新的面向任务的优化损失函数，以有效优化GBCD方法，GBCD方法包括单域和跨域方法. 在2个真实数据集上进行的大量实验表明，GBCD优于对比方法，这验证了GBCD在应对跨域推荐任务挑战方面的优越性.

在未来的工作中，我们将尝试纳入语义信息,如用户的社交信息和项知识图谱. 通过利用这些附加信息，可以提取更多丰富的特征，并更准确地模拟用户的细粒度偏好. 这种扩展可以进一步提升 GBCD模型的性能和个性化能力，为用户提供更全面、更有针对性的推荐. 此外，当前存在一些用预训练模型来解决稀疏性问题的研究^[54]，这是另一个解决数据稀疏性问题的思路. 作为未来工作，我们会尝试用预训练语言模型解决数据稀疏性问题.

作者贡献声明：黄玲、王昌栋提出了算法思路和实验方案；黄镇伟、黄梓源负责完成实验并撰写论文；关灿荣负责实验数据采集和预处理；高月芳、王昌栋提出指导意见并修改论文.

图 1 加速后的设计空间探索

Figure 1. Design space exploration after acceleration

下载: 全尺寸图片幻灯片

图 2 统计采样模拟

Figure 2. Statistical sample simulation

下载: 全尺寸图片幻灯片

图 3 综合模拟的流程（SFG ^[69]）

Figure 3. Synthetic simulation flow (SFG^[69])

下载: 全尺寸图片幻灯片

图 4 区间分析对破坏性缺失事件确定的区间基性能进行了分析^[104]

Figure 4. Interval analysis analyzes performance on an interval basis determined by disruptive miss events^[104]

下载: 全尺寸图片幻灯片

图 5 基于图的处理器性能模型^[180]

Figure 5. A graph-based processor performance model^[180]

下载: 全尺寸图片幻灯片

表 1 处理器微架构设计空间探索的加速方法分类

Table 1 Category of Acceleration Methods for Processor Microarchitecture Design Space Exploration

类型	子类型	典型方法
负载选择	基于微架构相关特征的方法	文献[15, 27]
	基于微架构无关特征的方法	MinneSPEC^[28]、文献[29–32]、BenchSubset^[33]、CASH^[34]
	基于微架构相关与无关特征的方法	文献[29, 35–37]、BenchPrime^[38]
部分模拟	统计采样模拟	采样单线程^[39-43]、采样多线程^[44-48]、采样访存^[49-51]
部分模拟	综合模拟	综合单线程^[52-55]、综合多线程^[56-58]、综合访存^[59-62]
设计点选择	采样方法	基于参数敏感度的方法^[15,63-67]、基于实验设计的方法^[6,25,34,67]
设计点选择	迭代搜索方法	启发式方法^[68-70]、组合优化方法^[68,71-74]、统计推理方法^{[14,25-26,67,75]}
模拟工具	软件模拟	SimpleScalar^[76]，SESC^[77]，gem5^[78]
	硬件模拟	FAST^[79], PROTOFLEX^[80-81], RAMP Gold^[82], HAsim^[83], FireSim^[84]
	敏捷开发	基于低级语言的平台^[85-87]、基于高级语言的平台^[50,88-89]
性能模型	特定负载预测模型	参数化模型^[2,4,90]、核函数模型^[13,68,91]、神经网络模型^[3,92-93]、树模型^[94-96]、集成学习模型^[67,97-98]
	跨负载预测模型	基于负载特征^[8,99-100]、基于硬件响应^[9,23,101]、基于迁移学习^[7,10-11]
	机械模型	分析模型^[102-103]、区间模型^[104-106]、图模型^[107-109]、概率统计模型^[110-112]、混合模型^[113-115]

下载: 导出CSV

表 2 加速方法对比

Table 2 Comparison of Acceleration Methods

类型	典型方法	加速比	准确率/%
负载选择	文献[37]	5.2	93.0
部分模拟	文献[116]	520	94.9
设计点选择	文献[117]	23000	99.0
模拟工具	文献[76]	1000	95.0
性能模型	文献[118]	180000	98.2

下载: 导出CSV

表 3 负载选择方法的对比

Table 3 Comparison of Workload Selecting Methods

方法类型	方法来源	使用微架构相关特征的方式	使用微架构无关特征的方式	聚类算法	负载选择比	误差/%
基于微架构相关特征的方法	文献[15]	参数显著性排名	✘	阈值聚类	7/12	-
基于微架构相关特征的方法	文献[27]	执行时间向量	✘	层次聚类	6/11	5
基于微架构无关特征的方法	文献[28]	✘	卡方检验	✘	-/23	-
	文献[29]	✘	主成分分析	层次聚类	7/79	-
	文献[30]	✘	基本块向量	距离最大	60/20 000	-
	文献[31]	✘	主成分分析	k均值聚类	9/21	15
	文献[32]	✘	基本块向量+主成分分析	层次聚类	4/47	-
	文献[33]	✘	分组主成分分析	共识聚类	-	-
	文献[34]	✘	独立成分分析	多种聚类	5/27	3
	文献[120]	✘	主成分分析/遗传算法	k质心聚类	50/118	5
	文献[121−122]	✘	凸壳体积、主成分分析	遗传算法	6/22	-
基于微架构相关与无关特征的方法	文献[29,35]	主成分分析		层次聚类	14/29	-
	文献[36]	主成分分析		层次聚类	10/23	-
	文献[37]	主成分分析		层次聚类	12/43	7
	文献[123]	多元因素分析		层次聚类	10/23	-
	文献[38]	主成分分析+线性判别		多种聚类	20/54	-
注：“负载选择比”列中的“/”表示选择的负载数量和全部负载数量之比，“-”表示文献中无数据. “✘”表示无该项.

下载: 导出CSV

表 4 常用基准套件汇总

Table 4 Summary of Common Benchmark Suits

类型	工作负载	简称
多媒体和通信	MediaBench^[124]	MediaBench
嵌入式	MiBench^[125]	MiBench
单线程	SPEC CPU 2000^[126]	SPEC2k
单线程	SPEC CPU 2006^[127]	SPEC2k6
单/多线程	SPEC CPU 2017^[18]	SPEC2k17
多线程	Princeton Application Repository for Shared-Memory Computers^[128]	PARSEC
多线程	Stanford Parallel Applications for Shared Memory^[129]	SPLASH

下载: 导出CSV

表 5 微架构相关特征

Table 5 Microarchitecture-Dependent Features

类型	特征
整体聚合	执行时间、CPI、功率
控制流	分支预测MPKI、BTB命中率
cache行为（Icache/Dcache/L2/L3）	访问数量、命中数量、MPKI
TLB行为（ITLB/DTLB/L2TLB）	访问数量、命中数量、MPKI
注：MPKI表示每千条指令缺失.

下载: 导出CSV

表 6 微架构无关特征

Table 6 Microarchitecture-Independent Features

类型	子类型	特征
指令流	指令混合	整型、浮点、SIMD等
		控制分支
		存储读/写
	寄存器通信	平均操作数数量
		平均使用次数
		重用距离
	指令级并行性	不同窗口大小的并行度
	指令级并行性	基本块大小
	指令局部性	指令工作集大小
	指令局部性	时间、空间重用距离
数据流	数据局部性	数据工作集大小
	数据局部性	时间、空间重用距离
	通信特征	私有数据读写次数
	通信特征	生产者写/消费者读次数
注：SIMD表示单指令多数据流.

下载: 导出CSV

表 7 部分模拟加速方法的对比

Table 7 Comparison of Partial Simulation Acceleration Methods

类型	目标	子类型	方法来源	指令流	数据流	微架构相关特征	加速比	误差/%
统计采样模拟	采样单线程	随机采样	文献[134]	✘	✘	✘	-	7~17
		均匀采样	文献[39]	✘	✘	✘	35~60	0.6
		均匀采样	文献[40]	✘	✘	✘	~4 000	3.5
		代表性采样	文献[41−42,135−136]	✔	✘	✘	62~107	3.7
			文献[43]	✔	✘	✘	1~1.4	0.5
			文献[137]	✔	✘	IPC, cache	~100	3
			文献[138]	✔	✘	IPC, cache	-	2~8
	采样多线程	基于时间	文献[44]	✔	✘	IPC	10	5
			文献[139]	✔	✘	IPC	5.8	3.5
			文献[45]	✔	✘	IPC	20	5.3
		基于负载和特定同步	文献[47]	✔	✘	✘	25	0.9
		基于负载和特定同步	文献[48]	✔	✘	✘	220	0.5
		基于循环迭代	文献[46]	✔	✘	✘	801	2.3
	采样访存	基于检查点	文献[49]	✘	✔	cache, BP	8 000~15 000	~0.6
			文献[51]	✘	✔	cache, BP	50~100	~0.6
			文献[50,140−141]	✘	✔	cache, BP	-	-
		基于预热	文献[142]	✔	✔	cache, BP	8 000~15 000	~0.6
			文献[143]	✔	✔	cache, BP	~100	1.5
			文献[144]	✔	✔	cache, BP	~70	0.3
			文献[145−147]	✔	✔	cache, BP	-	-
综合模拟	综合单线程		文献[54]	✔	✘	cache, BP	-	5~7
			文献[148]	✔	✘	cache, BP	-	4.1
			文献[149]	✔	✘	cache, BP	-	-
			文献[52−53]	✔	✘	cache, BP	-	8
			文献[150]	✔	✘	cache, BP	~1 000	6.6
			文献[55,151]	✔	✔	cache, BP	~1 000	2.4
			文献[116]	✔	✔	cache, BP	520	5.1
			文献[152]	✔	✔	cache, BP	-	3.2
			文献[153−155]	✔	✔	✘	-	-
	综合多线程		文献[58]	✔	✔	✘	9~385	3.8~9.8
			文献[156]	✔	✔	✘	1 000~10 000	4.9
			文献[56−57]	✔	✔	cache, BP	40~70	5.5
			文献[157]	✔	✔	cache, BP	21	8
	综合访存		文献[59]	✘	✔	✘	-	0.4~3.1
			文献[60−61]	✘	✔	✘	-	-
			文献[158−160]	✔	✔	✘	31	4.8
			文献[161]	✔	✔	✘	20	2.8
			文献[62]	✔	✔	✘	20~50	4.2
			文献[162]	✔	✔	✘	-	9
注：“-”表示文献无该数据. “✔”表示有使用该类数据，“✘”表示没有使用该类数据.

下载: 导出CSV

表 8 实验设计的对比

Table 8 Comparison of Design of Experiments

类型	实验设计	样本数
随机	均匀随机^[3,170]	M
基于参数级别	2级全阶乘^[25]	2^N
	中心复合设计^[25]	1+2N+2^N
	Box-Behnken^[25]	1+2^N
	PB设计^[15]	2N
	正交设计^[6-7]	相对固定
	拉丁超立方体^[92,97]、均匀拉丁超立方体^[34,171]	M
基于距离	智能采样^[3]	M
	最小的成对距离最大^[169]	M
	最大化距离矩阵的迹^[14]	M
	k均值^[172-173]	N
注：M为所需的样本数，N为参数个数.

下载: 导出CSV

表 9 迭代搜索加速方法的对比

Table 9 Comparison of Iterative Searching Acceleration Methods

类型	子类型	方法来源	代理模型	搜索/获取函数	硬件设计空间
启发式		文献[174]	-	参数聚类、贪心	单核片上系统
		文献[64]	-	敏感度、贪心	cache微架构
		文献[66]	-	敏感度、贪心	FPGA软核
		文献[175]	-	二进制搜索树	VLIW
		文献[176]	-	贪心、单目标化	CMP
组合优化	遗传算法	文献[71]	-	GA	单核片上系统
		文献[72]	2层次模拟	局部搜索+GA	单核CPU
		文献[73,177]	模糊系统	GA	VLIW
		文献[171]	多项式回归	GA	单核CPU
		文献[117]	-	爬山/GA/蚁群	CMP
		文献[74]	ANN预测级别	NSGA-II	CMP
		文献[69]	ANN	NSGA-II	CMP
		文献[178]	ANN	NSGA-II	VLIW
		文献[68]	ACOSSO	NSGA-II	CMP
	模拟退火	文献[178]	ANN预测级别	模拟退火	VLIW
	模拟退火	文献[179]	多种模型之一^[25]	多种搜索算法	CMP
统计推理	不确定度	文献[67,97]	AdaBoost.ANN	CoV	单核CPU
	不确定度	文献[172−173]	XGBoost	距离的最小值	单核CPU
	预期改善	文献[75,180]	克里金模型预测级别	EI(+GA)	CMP
	预期改善	文献[34]	随机深林	EI	CMP
	超体积改善	文献[13]	ACOSSO	EHVI	CMP
		文献[14]	高斯过程	EHVI	单核CPU
		文献[181]	AdaGBRT	HVI+均匀性	单核CPU
		文献[182]	BagGBRT	HVI+UCB	单核CPU
	帕累托	文献[25]	多种模型之一	候选帕累托最优解集	CMP
		文献[183−184]	马尔可夫决策	帕累托覆盖	CMP
		文献[26,168]	马尔可夫网预测分布	帕累托最优解集	CMP
注：“-”表示该方法只以软件模拟或基于RTL的电路评估的方式获取性能指标，其余方法可通过训练代理模型替代软件模拟来获取指标或指标之间的关系.

下载: 导出CSV

表 10 模拟工具的对比

Table 10 Comparison of Simulation Tools

类型	准确率	模拟速度	灵活性	开发难度
软件模拟	低	中（~10 MHz）	高	低
硬件模拟	中	快（~100 MHz）	低	中
敏捷设计	高	慢（1~5 kHz）	中	高

下载: 导出CSV

表 11 硬件模拟平台的对比

Table 11 Comparison of Hardware Simulation Platforms

年份	平台	功能模拟	时序模拟	核心数	速率/ (MIPS/核)
2007	FAST^[79]	QEMU	FPGA	1	1.20
2009	ProtoFlex^[80-81]	FPGA	软件	16	-
2010	RAMP Gold^[82]	FPGA	FPGA	64	0.78
2011	HAsim^[83]	FPGA	FPGA	15	8.47
2018	FireSim^[84]	FPGA	FPGA	4096	3.42
注：速率的单位为MIPS/核. “-”表示文献无该数据.

下载: 导出CSV

表 12 敏捷开发平台的对比

Table 12 Comparison of Agile Development Platforms

语言类型	平台	设计语言	指令集	年份
低级语言	OpenPiton^[85]	Verilog HDL	SPARCv9	2016
	LiveHD^[86]	Verilog HDL	RISC-V	2020
	BlackParrot ^[87]	SystemVerilog	RISC-V	2020
高级语言	CMD^[88]	BlueSpec	RISC-V	2018
	Agile^[197]	Chisel	RISC-V	2016
	Chipyard^[89]	Chisel	RISC-V	2020
	MINJIE^[50]	Chisel	RISC-V	2022
语言模型	llvm-mca^[198]	-	-	2018
	Ithemal^[199]	-	CISC	2019
	Chip-Chat^[200]	自然语言	-	2023
	ChipGPT^[201]	自然语言	RISC	2023
	RTLLM^[202]	自然语言	RISC	2023
注：“-”表示文献无该项.

下载: 导出CSV

表 13 性能模型的对比

Table 13 Comparison of Performance Models

类型	准确性	复杂度	可解释性
预测模型	低	低	低
机械模型	高	高	高

下载: 导出CSV

表 14 性能预测模型的对比

Table 14 Comparison of Performance Prediction Models

类型	准确性	复杂度	可解释性
参数化	低	低	高
核函数	中	中	低
神经网络	中	高	低
树模型	中	中	高
集成学习	高	高	中

下载: 导出CSV

表 15 特定负载预测模型的对比

Table 15 Comparison of Workload-Specific Prediction Models

类型	预测模型	硬件设计空间	预测指标	负载	误差/%	R²	采样/设计空间
参数化	线性回归^[90]	单核	CPI	MinnerSPEC	0.8	-	200/67×10⁶
	受限三次样条回归^[2,4]	单核、异构核	CPI, E, P	SPEC2k	4.9	-	4×10³/22×10⁹
	三次样条回归模型^[5]	单核、多核	T	18项负载	1.4	-	300/4.3×10⁹
	埃尔米特多项式插值^[210]	PHT, cache	E	SPEC2k, MediaBench	-	-	243/19×10³
核函数	支持向量机^[170]	单核	T, E	SPEC2k	0.5	-	12/4608
	内核典型相关分析^[211]	多核	T, E	ENePBench	6.2	0.88	450/2.8×10⁶
	ACOSSO^[68]	单核、多核	T, E, P	SPEC2k, SPLASH-2	-	-	450/128×10³
	ACOSSO^[13]	多核	T, E, P	SPLASH-2	-	-	100/332×10³
	高斯过程^[91]	核数	T	SPLASH-3, PARSEC-3	-	0.82	67/68
	高斯过程^[14]	单核	T, E, P	27项负载	-	-	14/994
神经网络	径向基函数网络^[92]	单核	CPI	MinnerSPEC	2.8	-	200/512
	小波神经网络^[93]	单核	CPI, E, P	SPEC2k	-	-	1024/246×10³
	神经网络^{[3,209,212-213]}	单核、多核	CPI	MinneSPEC等	2.3	-	221/23×10³
	神经网络+遗传算法^[214]	单核	CPI	SPEC2k	3.3		230/23×10³
树模型	模型树^[94]	性能计数器	CPI	SPEC2k6	7.8	0.98	-
	模型树^[95]	单核	T, E	图像压缩负载	1.3	0.95	3211/3288
	决策树^[138]	性能计数器	CPI	SPEC2k6,SysMark07等	2	-	-
	决策树^[96]	异构核	T, E	SD-VBS, MiBench	2.1	-	664/830
集成学习	自适应提升+神经网络^[67,97]	单核	CPI	SPEC2k6	-	-	264/8.4×10⁶
	梯度提升回归树^[169]	单核、多核	T	SPEC2k, SPLASH-2	1.1	-	3×10³/15×10⁶
	XGBoost^[172]	单核	E	riscv-tests	3.4	0.99	1120/1200
	提升法+梯度提升回归树^[181]	单核	CPI, E, P	SPEC2k17	-	-	100/2×10³
	装袋法+模型树^[98]	单核	CPI, E	SPEC2k	-	-	320/71×10⁶
	装袋法+梯度提升回归树^[182]	单核	CPI, E, P	SPEC2k17	-	-	100/37×10³
	堆叠法+决策树^[22]	单核、多核	T, E	SPEC2k6,SPLASH-2	-	-	100/605×10³
	堆叠法+异类模型^[118]	单核	CPI, E	SPEC2k	1.8	-	3×10³/2.5×10⁹
注：硬件设计空间中单核主要包括单核处理器微架构，多核指基于总线或片上网络的同构多核处理器. “T”指时间，“E”指功率，“P”指对多个性能指标探索帕累托最优解集，误差以CPI的百分比绝对误差衡量（越接近0越好），R²为相关系数（越接近1越好），“-”表示该工作无显式标注数据.

下载: 导出CSV

表 16 跨负载预测模型工作的对比

Table 16 Comparison of Cross-Workload Prediction Model Work

类型	来源	预测模型	跨负载方法核心	性能指标	设计点数	误差/%	R²
负载特征	文献[99]	归一化、PCA+GA、线性回归	负载特征、平均相似负载的结果	时间	35×25+0	-	-
	文献[8]	多项式回归+遗传算法	负载特征	CPI	360×7+0	8~10	>0.90
	文献[100]	模型树	负载特征	CPI、功率	500×25+0	-	0.90
	文献[34]	多种模型之一	负载特征、最近邻归类模型	CPI、功率	3 000×10+0	-	0.98
硬件响应	文献[9]	神经网络	模型本身泛化	时间	639×27+50	-	-
	文献[23]	矩阵补全算法	模型本身泛化	CPI、功率	128×20+20	10.0	-
	文献[101]	线性回归	响应边际关系、最近邻归类	CPI	60×23+600	6.3	0.92
	文献[215]	神经网络	响应签名、模型本身泛化	时间、EDP	1000×8+0	4.2	-
迁移学习	文献[11]	神经网络	线性回归	CPI、功率	512×5+32	7.0	0.95
	文献[10,216]	神经网络	贪心选择负载、线性回归	CPI、功率	512×5+32	3.0	-
	文献[7]	模型树+自适应提升	负载聚类、样本迁移TrAdaBoost	CPI	10×5+10	7.0	0.91
	文献[6]	神经网络+自适应提升	支持向量机	CPI	128×3+40	5.5	0.93
注：“-”表示文献无该数据. “设计点数”列中的表达形式为源样本数量×源负载数量+目标样本数量.

下载: 导出CSV

表 17 跨负载预测模型的对比

Table 17 Comparison of Cross-Workload Prediction Models

类型	核心	准确性	复杂度
负载特征	特征空间的相似性	低	低
硬件响应	硬件响应作为新维度	中	中
迁移学习	元模型的知识迁移	高	高

下载: 导出CSV

表 18 机械模型工作的对比

Table 18 Comparison of Mechanism Model Work

类型	来源	目标架构	组件	预测指标	仅微架构无关特征	误差/%	速率/(MIPS/核)
分析模型	文献[219]	cache	cache	cache缺失	✘	-	-
	文献[102]	乱序、单核	指令窗口、BP、cache	IPC	✘	5.5	100
	文献[103]	cache、多核	cache	IPC	✘	1.57	-
	文献[220]	cache	cache	功率、面积	✘	5	-
	文献[221]	按/乱序、多核	BP、cache、NoC等	功率、面积	✘	11~23	-
区间模型	文献[222]	乱序、单核	BP, cache	IPC	✘	5.8	-
	文献[104]	乱序、单核	BP, cache	IPC	✘	7	-
	文献[223]	乱序、多核	BP, cache	IPC	✘	4.6	~1
	文献[105,224]	按序、单核	指令依赖、BP, cache	CPI、功率	✘	2.5	~6
	文献[225]	乱序、单核	BP, cache	IPC、功率	✔	9.3	1.9
	文献[226]	乱序、多核	BP, cache	CPI	✔	11.2	-
	文献[227]	乱序、单核	SIMD、cache、带宽	CPI、功率	✔	25	-
	文献[228]	乱序、多核	SIMD、cache、带宽	时间、功率	✔	36	-
图模型	文献[107]	乱序、单核	BP, cache	CPI	✘	-	-
	文献[108]	乱序、单核	BP, cache	CPI	✘	-	-
	文献[109]	乱序、多核	BP, cache, NoC	IPC	✘	7.2	~12
概率统计模型	文献[110]	乱序、单核	BP, cache	IPC	✘	2~10	-
	文献[111]	乱序、多核	BP, cache	IPC	✘	7.9	~9
	文献[112]	cache	cache	cache缺失	✘	0.2	-
混合模型	文献[113]	乱序、单核	流水线深度	IPC	✘	-	-
	文献[114]	乱序、单核	cache、MSHR、预取	CPI、cache缺失	✘	9.4	~15
	文献[115]	乱序、单核	执行单元	CPI	✘	5.6	15.1
注：“-”表示文献无该数据.

下载: 导出CSV

表 19 机械模型的对比

Table 19 Comparison of Mechanism Models

类型	核心思想	准确性	复杂度
分析模型	数学公式	低	低
区间模型	事件分隔的区间	中	高
图模型	依赖图的关键路径	中	中
概率统计模型	事件发生的概率	中	中
混合模型	分析模型+预测模型	中	低

下载: 导出CSV

参考文献(237)

[1]	Azizi O, Mahesri A, Lee B C, et al. Energy-performance tradeoffs in processor architecture and circuit design: A marginal cost analysis[C]//Proc of the 27th Annual Int Symp on Computer Architecture. New York: ACM, 2010: 26–36
[2]	Lee B C, Brooks D M. Illustrative design space studies with microarchitectural regression models[C]//Proc of the 13th Int Conf on High-Performance Computer Architecture. Los Alamitos, CA: IEEE Computer Society, 2007: 340–351
[3]	Ipek E, McKee S A, Caruana R, et al. Efficiently exploring architectural design spaces via predictive modeling[C]//Proc of the 12th Int Conf on Architectural Support for Programming Languages and Operating Systems. New York: ACM, 2006: 195–206
[4]	Lee B C, Brooks D M. Accurate and efficient regression modeling for microarchitectural performance and power prediction[C]//Proc of the 12th Int Conf on Architectural Support for Programming Languages and Operating Systems. New York: ACM, 2006: 185–194
[5]	Lee B C, Collins J D, Wang Hong, et al. CPR: Composable performance regression for scalable multiprocessor models[C]//Proc of the 41st Annual IEEE/ACM Int Symp on Microarchitecture. Piscataway, NJ: IEEE, 2008: 270–281
[6]	Li Dandan, Yao Shuzhen, Wang Senzhang, et al. Cross-program design space exploration by ensemble transfer learning[C]//Proc of the 36th IEEE/ACM Int Conf on Computer-Aided Design. Piscataway, NJ: IEEE, 2017: 201–208
[7]	Li Dandan, Wang Senzhang, Yao Shuzhen, et al. Efficient design space exploration by knowledge transfer[C]//Proc of the 11th IEEE/ACM/IFIP Int Conf on Hardware/Software Codesign and System Synthesis. New York: ACM, 2016: 12: 1−12: 10
[8]	Wu Weidan, Lee B C. Inferred models for dynamic and sparse hardware-software spaces[C]//Proc of the 45th Annual IEEE/ACM Int Symp on Microarchitecture. Los Alamitos, CA: IEEE Computer Society, 2012: 413–424
[9]	Wang Yu, Lee V, Wei G Y, et al. Predicting new workload or CPU performance by analyzing public datasets[J]. ACM Transactions on Architecture and Code Optimization, 2019, 15(4): 53: 1−53: 21
[10]	Dubach C, Jones T M, O’Boyle M F P. An empirical architecture-centric approach to microarchitectural design space exploration[J]. IEEE Transactions on Computers, 2011, 60(10): 1445−1458 doi: 10.1109/TC.2010.280
[11]	Dubach C, Jones T M, O’Boyle M F P. Microarchitectural design space exploration using an architecture-centric approach[C]//Proc of the 40th Annual IEEE/ACM Int Symp on Microarchitecture. Los Alamitos, CA: IEEE Computer Society, 2007: 262–271
[12]	Eeckhout L, De Bosschere K. Speeding up architectural simulations for high-performance processors[J]. Simulation, 2004, 80(9): 451−468 doi: 10.1177/0037549704044326
[13]	Wang Hongwei, Shi Jinglin, Zhu Ziyuan. An expected hypervolume improvement algorithm for architectural exploration of embedded processors[C]//Proc of the 53rd Annual Design Automation Conf. New York: ACM, 2016: 161: 1−161: 6
[14]	Bai Chen, Sun Qi, Zhai Jianwang, et al. BOOM-Explorer: RISC-V BOOM microarchitecture design space exploration framework[C/OL]//Proc of the 40th IEEE/ACM Int Conf on Computer Aided Design. Piscataway, NJ: IEEE, 2021[2023-12-17]. https://ieeexplore.ieee.org/document/9643455
[15]	Yi J J, Lilja D J, Hawkins D M. A statistically rigorous approach for improving simulation methodology[C]//Proc of the 9th Int Symp on High-Performance Computer Architecture. Los Alamitos, CA: IEEE Computer Society, 2003: 281–291
[16]	Monchiero M, Canal R, González A. Power/performance/thermal design-space exploration for multicore architectures[J]. IEEE Transactions on Parallel and Distributed Systems, 2008, 19(5): 666−681 doi: 10.1109/TPDS.2007.70756
[17]	包云岗,常轶松,韩银和,等. 处理器芯片敏捷设计方法:问题与挑战[J]. 计算机研究与发展,2021,58(6):1131−1145 doi: 10.7544/issn1000-1239.2021.20210232 Bao Yungang, Chang Yisong, Han Yinhe, et al. Agile design of processor chips: Issues and challenges[J]. Journal of Computer Research and Development, 2021, 58(6): 1131−1145 (in Chinese) doi: 10.7544/issn1000-1239.2021.20210232
[18]	Standard Performance Evaluation Corporation. SPEC CPU2017[EB/OL]. (2012-12-06)[2023-12-01]. https://www.spec.org/cpu2017
[19]	Yi J J, Lilja D J. Simulation of computer architectures: Simulators, benchmarks, methodologies, and recommendations[J]. IEEE Transactions on Computers, 2006, 55(3): 268−280 doi: 10.1109/TC.2006.44
[20]	Guo Qi, Chen Tianshi, Chen Yunji, et al. Accelerating architectural simulation via statistical techniques: A survey[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2016, 35(3): 433−446 doi: 10.1109/TCAD.2015.2481796
[21]	O’Neal K, Brisk P. Predictive modeling for CPU, GPU, and FPGA performance and power consumption: A survey[C]//Proc of the 2018 IEEE Computer Society Annual Symp on VLSI. Los Alamitos, CA: IEEE Computer Society, 2018: 763–768
[22]	Chen Tianshi, Guo Qi, Tang Ke, et al. ArchRanker: A ranking approach to design space exploration[C]//Proc of the 41st Int Symp on Computer Architecture. Los Alamitos, CA: IEEE Computer Society, 2014: 85–96
[23]	Ding Yi, Mishra N, Hoffmann H. Generative and multi-phase learning for computer systems optimization[C]//Proc of the 46th Int Symp on Computer Architecture. New York: ACM, 2019: 39–52
[24]	Panerati J, Beltrame G. A comparative evaluation of multi-objective exploration algorithms for high-level design[J]. ACM Transactions on Design Automation of Electronic Systems, 2014, 19(2): 15: 1–15: 22
[25]	Palermo G, Silvano C, Zaccaria V. ReSPIR: A response surface-based pareto iterative refinement for application-specific design space exploration[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2009, 28(12): 1816−1829 doi: 10.1109/TCAD.2009.2028681
[26]	Mariani G, Palermo G, Zaccaria V, et al. DeSpErate++: An enhanced design space exploration framework using predictive simulation scheduling[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2015, 34(2): 293−306 doi: 10.1109/TCAD.2014.2379634
[27]	Cammarota R, Beni L A, Nicolau A, et al. Effective evaluation of multi-core based systems[C]//Proc of the 12th Int Symp on Parallel and Distributed Computing. Piscataway, NJ: IEEE, 2013: 19–25
[28]	KleinOsowski A J, Lilja D J. MinneSPEC: A new spec benchmark workload for simulation-based computer architecture research[J]. IEEE Computer Architecture Letters, 2002, 1(1): 7−10 doi: 10.1109/L-CA.2002.8
[29]	Eeckhout L, Vandierendonck H, De Bosschere K. Workload design: Selecting representative program-input pairs[C]//Proc of the 11th Int Conf on Parallel Architectures and Compilation Techniques. Los Alamitos, CA: IEEE Computer Society, 2002: 83–94
[30]	Breughe M, Eeckhout L. Selecting representative benchmark inputs for exploring microprocessor design spaces[J]. ACM Transactions on Architecture and Code Optimization, 2013, 10(4): 37: 1−37: 24
[31]	Joshi A, Phansalkar A, Eeckhout L, et al. Measuring benchmark similarity using inherent program characteristics[J]. IEEE Transactions on Computers, 2006, 55(6): 769−782 doi: 10.1109/TC.2006.85
[32]	Vandeputte F, Eeckhout L. Phase complexity surfaces: Characterizing time-varying program behavior[C]//Proc of the 3rd High Performance Embedded Architectures and Compilers. Berlin: Springer, 2008: 320–334
[33]	Zhan Hongping, Lin Weiwei, Mao Feiqiao, et al. BenchSubset: A framework for selecting benchmark subsets based on consensus clustering[J]. International Journal of Intelligent Systems, 2022, 37(8): 5248−5271 doi: 10.1002/int.22791
[34]	Sheidaeian H, Fatemi O. Toward a general framework for jointly processor-workload empirical modeling[J]. The Journal of Supercomputing, 2021, 77(6): 5319−5353 doi: 10.1007/s11227-020-03475-9
[35]	Phansalkar A, Joshi A, John L K. Analysis of redundancy and application balance in the SPEC CPU2006 benchmark suite[C]//Proc of the 34th Int Symp on Computer Architecture. New York: ACM, 2007: 412–423
[36]	Limaye A, Adegbija T. A workload characterization of the SPEC CPU2017 benchmark suite[C]//Proc of the 2018 IEEE Int Symp on Performance Analysis of Systems and Software. Los Alamitos, CA: IEEE Computer Society, 2018: 149–158
[37]	Panda R, Song Shuang, Dean J, et al. Wait of a decade: Did SPEC CPU 2017 broaden the performance horizon[C]//Proc of the 23rd IEEE Int Symp on High Performance Computer Architecture. Los Alamitos, CA: IEEE Computer Society, 2018: 271–282
[38]	Liu Qingrui, Wu Xiaolong, Kittinger L, et al. BenchPrime: Effective building of a hybrid benchmark suite[J]. ACM Transactions in Embedded Computing Systems, 2017, 16(5): 179: 1−179: 22
[39]	Wunderlich R E, Wenisch T F, Falsafi B, et al. SMARTS: Accelerating microarchitecture simulation via rigorous statistical sampling[C]//Proc of the 30th Annual Int Symp on Computer Architecture. Los Alamitos, CA: IEEE Computer Society, 2003: 84–95
[40]	Hassani S, Southern G, Renau J. LiveSim: Going live with microarchitecture simulation[C]//Proc of the 22nd IEEE Int Symp on High Performance Computer Architecture. Los Alamitos, CA: IEEE Computer Society, 2016: 606–617
[41]	Hamerly G, Perelman E, Lau J, et al. SimPoint 3.0: Faster and more flexible program phase analysis[J/OL]. Journal of Instruction-Level Parallelism, 2005[2023-12-18]. http://www.jilp.org/vol7/v7paper14.pdf
[42]	Sherwood T, Perelman E, Hamerly G, et al. Discovering and exploiting program phases[J]. IEEE Micro, 2003, 23(6): 84−93 doi: 10.1109/MM.2003.1261391
[43]	Shen Xipeng, Zhong Yutao, Ding Chen. Locality phase prediction[C]//Proc of the 11th Int Conf on Architectural Support for Programming Languages and Operating Systems. New York: ACM, 2004: 165–176
[44]	Ardestani E K, Renau J. ESESC: A fast multicore simulator using time-based sampling[C]//Proc of the 19th Int Symp on High Performance Computer Architecture. Los Alamitos, CA: IEEE Computer Society, 2013: 448–459
[45]	Jiang Chuntao, Yu Zhibin, Jin Hai, et al. PCantorSim: Accelerating parallel architecture simulation through fractal-based sampling[J]. ACM Transactions on Architecture and Code Optimization, 2013, 10(4): 49: 1–49: 24
[46]	Sabu A, Patil H, Heirman W, et al. LoopPoint: Checkpoint-driven sampled simulation for multi-threaded applications[C]//Proc of the 28th Int Symp on High-Performance Computer Architecture. Piscataway, NJ: IEEE, 2022: 604–618
[47]	Carlson T E, Heirman W, Van Craeynest K, et al. BarrierPoint: Sampled simulation of multi-threaded applications[C]//Proc of the 2014 IEEE Int Symp on Performance Analysis of Systems and Software. Los Alamitos, CA: IEEE Computer Society, 2014: 2–12
[48]	Grass T, Carlson T E, Rico A, et al. Sampled simulation of task-based programs[J]. IEEE Transactions on Computers, 2019, 68(2): 255−269 doi: 10.1109/TC.2018.2860012
[49]	Wenisch T F, Wunderlich R E, Ferdman M, et al. SimFlex: Statistical sampling of computer system simulation[J]. IEEE Micro, 2006, 26(4): 18−31 doi: 10.1109/MM.2006.79
[50]	Xu Yinan, Yu Zihao, Tang Dan, et al. Towards developing high performance RISC-V processors using agile methodology[C]//Proc of the 55th IEEE/ACM Int Symp on Microarchitecture. Piscataway, NJ: IEEE, 2022: 1178–1199
[51]	Bryan P D, Rosier M C, Conte T M. Reverse state reconstruction for sampled microarchitectural simulation[C]//Proc of the 2007 IEEE Int Symp on Performance Analysis of Systems & Software. Los Alamitos, CA: IEEE Computer Society, 2007: 190–199
[52]	Nussbaum S, Smith J E. Modeling superscalar processors via statistical simulation[C]//Proc of the 10th Int Conf on Parallel Architectures and Compilation Techniques. Los Alamitos, CA: IEEE Computer Society, 2001: 15–24
[53]	Eeckhout L, Nussbaum S, Smith J E, et al. Statistical simulation: Adding efficiency to the computer designer’s toolbox[J]. IEEE Micro, 2003, 23(5): 26−38 doi: 10.1109/MM.2003.1240210
[54]	Oskin M, Chong F T, Farrens M. HLS: Combining statistical and symbolic simulation to guide microprocessor designs[C]//Proc of the 27th Int Symp on Computer Architecture. Los Alamitos, CA: IEEE Computer Society, 2000: 71–82
[55]	Bell R H, John L K. Improved automatic testcase synthesis for performance model validation[C]//Proc of the 19th Annual Int Conf on Supercomputing. New York: ACM, 2000: 111–120
[56]	Genbrugge D, Eeckhout L. Statistical simulation of chip multiprocessors running multi-program workloads[C]//Proc of the 25th Int Conf on Computer Design. Piscataway, NJ: IEEE, 2007: 464–471
[57]	Genbrugge D, Eeckhout L. Chip multiprocessor design space exploration through statistical simulation[J]. IEEE Transactions on Computers, 2009, 58(12): 1668−1681 doi: 10.1109/TC.2009.77
[58]	Hughes C, Li T. Accelerating multi-core processor design space evaluation using automatic multi-threaded workload synthesis[C]//Proc of the 4th Int Symp on Workload Characterization. Los Alamitos, CA: IEEE Computer Society, 2008: 163–172
[59]	Balakrishnan G, Solihin Y. WEST: Cloning data cache behavior using stochastic traces[C]//Proc of the 18th IEEE Int Symp on High-Performance Comp Architecture. Los Alamitos, CA: IEEE Computer Society, 2012: 1–12
[60]	Awad A, Solihin Y. STM: Cloning the spatial and temporal memory access behavior[C]//Proc of the 20th IEEE Int Symp on High Performance Computer Architecture. Los Alamitos, CA: IEEE Computer Society, 2014: 237–247
[61]	Wang Yipeng, Awad A, Solihin Y. Clone morphing: Creating new workload behavior from existing applications[C]//Proc of the 2017 IEEE Int Symp on Performance Analysis of Systems and Software. Los Alamitos, CA: IEEE Computer Society, 2017: 97–108
[62]	Wang Yipeng, Balakrishnan G, Solihin Y. MeToo: Stochastic modeling of memory traffic timing behavior[C]//Proc of the 24th Int Conf on Parallel Architecture and Compilation. Los Alamitos, CA: IEEE Computer Society, 2015: 457–467
[63]	Hekstra G J, La Hei G D, Bingley P, et al. TriMedia CPU64 design space exploration[C]//Proc of the 17th IEEE Int Conf on Computer Design: VLSI in Computers and Processors. Los Alamitos, CA: IEEE Computer Society, 1999: 599–606
[64]	Fornaciari W, Sciuto D, Silvano C, et al. A design framework to efficiently explore energy-delay tradeoffs[C]//Proc of the 9th Int Symp on Hardware/Software Codesign. New York: ACM, 2001: 260–265
[65]	Fornaciari W, Sciuto D, Silvano C, et al. A sensitivity-based design space exploration methodology for embedded systems[J]. Design Automation for Embedded Systems, 2002, 7(1): 7−33
[66]	Sheldon D, Kumar R, Lysecky R, et al. Application-specific customization of parameterized FPGA soft-core processors[C]//Proc of the 25th IEEE/ACM Int Conf on Computer-Aided Design. New York: ACM, 2006: 261–268
[67]	Li Dandan, Yao Shuzhen, Liu Yuhang, et al. Efficient design space exploration via statistical sampling and AdaBoost learning[C]//Proc of the 53rd Annual Design Automation Conf. New York: ACM, 2016: 142: 1−142: 6
[68]	Wang Hongwei, Zhu Ziyuan, Shi Jinglin, et al. An accurate acosso metamodeling technique for processor architecture design space exploration[C]//Proc of the 20th Asia and South Pacific Design Automation Conf. Piscataway, NJ: IEEE, 2015: 689–694
[69]	Mariani G, Palermo G, Zaccaria V, et al. Design-space exploration and runtime resource management for multicores[J]. ACM Transactions on Embedded Computing Systems, 2013, 13(2): 20: 1−20: 27
[70]	Jahr R, Calborean H, Vintan L, et al. Boosting design space explorations with existing or automatically learned knowledge[C]//Proc of the 15th Measurement, Modelling, and Evaluation of Computing Systems and Dependability and Fault Tolerance. Berlin: Springer, 2012: 221–235
[71]	Palesi M, Givargis T. Multi-objective design space exploration using genetic algorithms[C]//Proc of the 10th Int Symp on Hardware/Software Codesign. New York: ACM, 2002: 67–72
[72]	Eyerman S, Eeckhout L, De Bosschere K. Efficient design space exploration of high performance embedded out-of-order processors[C]//Proc of the 9th Design, Automation & Test in Europe Conf and Exhibition. Piscataway, NJ: IEEE, 2006: 351−356
[73]	Ascia G, Catania V, Di Nuovo A G, et al. Efficient design space exploration for application specific systems-on-a-chip[J]. Journal of Systems Architecture, 2007, 53(10): 733−750 doi: 10.1016/j.sysarc.2007.01.004
[74]	Mariani G, Palermo G, Silvano C, et al. Multi-processor system-on-chip design space exploration based on multi-level modeling techniques[C]//Proc of the 9th Int Conf on Embedded Computer Systems: Architectures, Modeling and Simulation. Piscataway, NJ: IEEE, 2009: 118–124
[75]	Mariani G, Palermo G, Zaccaria V, et al. OSCAR: An optimization methodology exploiting spatial correlation in multicore design spaces[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2012, 31(5): 740−753 doi: 10.1109/TCAD.2011.2177457
[76]	Burger D, Austin T M. The SimpleScalar tool set, version 2.0[J]. ACM SIGARCH Computer Architecture News, 1997, 25(3): 13−25 doi: 10.1145/268806.268810
[77]	Renau J, Fraguela B, Tuck J, et al. SESC simulator[EB/OL]. 2005[2023-12-01]. http://sesc.sourceforge.net
[78]	Binkert N, Beckmann B, Black G, et al. The gem5 simulator[J]. ACM SIGARCH Computer Architecture News, 2011, 39(2): 1−7 doi: 10.1145/2024716.2024718
[79]	Chiou D, Sunwoo D, Kim J, et al. FPGA-accelerated simulation technologies (FAST): Fast, full-system, cycle-accurate simulators[C]//Proc of the 40th Annual IEEE/ACM Int Symp on Microarchitecture. Los Alamitos, CA: IEEE Computer Society, 2007: 249–261
[80]	Chung E S, Nurvitadhi E, Hoe J C, et al. A complexity-effective architecture for accelerating full-system multiprocessor simulations using FPGAs[C]//Proc of the 16th Int ACM/SIGDA Symp on Field Programmable Gate Arrays. New York: ACM, 2008: 77–86
[81]	Chung E S, Papamichael M K, Nurvitadhi E, et al. ProtoFlex: Towards scalable, full-system multiprocessor simulations using FPGAs[J]. ACM Transactions on Reconfigurable Technology and Systems, 2009, 2(2): 15: 1–15: 32
[82]	Tan Zhangxi, Waterman A, Avizienis R, et al. RAMP Gold: An FPGA-based architecture simulator for multiprocessors[C]//Proc of the 47th Design Automation Conf. New York: ACM, 2010: 463–468
[83]	Pellauer M, Adler M, Kinsy M, et al. HAsim: FPGA-based high-detail multicore simulation using time-division multiplexing[C]//Proc of the 17th Int Symp on High Performance Computer Architecture. Los Alamitos, CA: IEEE Computer Society, 2011: 406–417
[84]	Karandikar S, Mao H, Kim D, et al. FireSim: FPGA-accelerated cycle-exact scale-out system simulation in the public cloud[C]//Proc of the 45th Annual Int Symp on Computer Architecture. Los Alamitos, CA: IEEE Computer Society, 2018: 29–42
[85]	Balkind J, McKeown M, Fu Yaosheng, et al. OpenPiton: An open source manycore research framework[C]//Proc of the 21st Int Conf on Architectural Support for Programming Languages and Operating Systems. New York: ACM, 2016: 217–232
[86]	Wang Shenghong, Possignolo R T, Skinner H B, et al. LiveHD: A productive live hardware development flow[J]. IEEE Micro, 2020, 40(4): 67−75 doi: 10.1109/MM.2020.2996508
[87]	Petrisko D, Gilani F, Wyse M, et al. BlackParrot: An agile open-source RISC-V multicore for accelerator socs[J]. IEEE Micro, 2020, 40(4): 93−102 doi: 10.1109/MM.2020.2996145
[88]	Zhang Sizhuo, Wright A, Bourgeat T, et al. Composable building blocks to open up processor design[C]//Proc of the 51st Annual IEEE/ACM Int Symp on Microarchitecture. Los Alamitos, CA: IEEE Computer Society, 2018: 68–81
[89]	Amid A, Biancolin D, Gonzalez A, et al. Chipyard: Integrated design, simulation, and implementation framework for custom socs[J]. IEEE Micro, 2020, 40(4): 10−21 doi: 10.1109/MM.2020.2996616
[90]	Joseph P J, Vaswani K, Thazhuthaveetil M J. Construction and use of linear regression models for processor performance analysis[C]//Proc of the 12th Int Symp on High-Performance Computer Architecture. Los Alamitos, CA: IEEE Computer Society, 2006: 99–108
[91]	Agarwal N, Jain T, Zahran M. Performance prediction for multi-threaded applications[C]//Proc of the 2nd Int Workshop on AI-assisted Design for Architecture. New York: ACM, 2019: 71−76
[92]	Joseph P J, Vaswani K, Thazhuthaveetil M J. A predictive performance model for superscalar processors[C]//Proc of the 39th Annual IEEE/ACM Int Symp on Microarchitecture. Los Alamitos, CA: IEEE Computer Society, 2006: 161–170
[93]	Cho C B, Zhang Wangyuan, Li Tao. Informed microarchitecture design space exploration using workload dynamics[C]//Proc of the 40th Annual IEEE/ACM Int Symp on Microarchitecture. Los Alamitos, CA: IEEE Computer Society, 2007: 274–285
[94]	Ould-Ahmed-Vall E, Woodlee J, Yount C, et al. Using model trees for computer architecture performance analysis of software applications[C]//Proc of the 2007 IEEE Int Symp on Performance Analysis of Systems and Software. Los Alamitos, CA: IEEE Computer Society, 2007: 116–125
[95]	Powell A, Savvas-Bouganis C, Cheung P Y K. High-level power and performance estimation of FPGA-based soft processors and its application to design space exploration[J]. Journal of Systems Architecture, 2013, 59(10): 1144−1156 doi: 10.1016/j.sysarc.2013.08.003
[96]	Mankodi A, Bhatt A, Chaudhury B. Predicting physical computer systems performance and power from simulation systems using machine learning model[J]. Computing, 2022, 105(5): 1−19
[97]	Li Dandan, Yao Shuzhen, Wang Ying. Processor design space exploration via statistical sampling and semi-supervised ensemble learning[J]. IEEE Access, 2018, 6: 25495−25505 doi: 10.1109/ACCESS.2018.2831079
[98]	Guo Qi, Chen Tianshi, Chen Yunji, et al. Effective and efficient microprocessor design space exploration using unlabeled design configurations[C]//Proc of the 22nd Int Joint Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2011: 1671–1677
[99]	Hoste K, Phansalkar A, Eeckhout L, et al. Performance prediction based on inherent program similarity[C]//Proc of the 15th Int Conf on Parallel Architectures and Compilation Techniques. New York: ACM, 2006: 114–122
[100]	Guo Qi, Chen Tianshi, Chen Yunji, et al. Microarchitectural design space exploration made fast[J]. Microprocessors and Microsystems, 2013, 37(1): 41−51 doi: 10.1016/j.micpro.2012.07.006
[101]	Ahmadinejad H, Fatemi O. Moving towards grey-box predictive models at micro-architecture level by investigating inherent program characteristics[J]. IET Computers Digital Techniques, 2018, 12(2): 53−61 doi: 10.1049/iet-cdt.2016.0148
[102]	Taha T M, Wills S. An instruction throughput model of superscalar processors[J]. IEEE Transactions on Computers, IEEE, 2008, 57(3): 389−403 doi: 10.1109/TC.2007.70817
[103]	Xu Chi, Chen Xi, Dick R P, et al. Cache contention and application performance prediction for multi-core systems[C]//Proc of the 2010 IEEE Int Symp on Performance Analysis of Systems & Software. Los Alamitos, CA: IEEE Computer Society, 2010: 76–86
[104]	Eyerman S, Eeckhout L, Karkhanis T, et al. A mechanistic performance model for superscalar out-of-order processors[J]. ACM Transactions on Computer Systems, 2009, 27(2): 3: 1–3: 37
[105]	Breughe M B, Eyerman S, Eeckhout L. Mechanistic analytical modeling of superscalar in-order processor performance[J]. ACM Transactions on Architecture and Code Optimization, 2015, 11(4): 50: 1–50: 26
[106]	Carlson T E, Heirman W, Eeckhout L. Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulation[C]//Proc of the 2011 Conf on High Performance Computing Networking, Storage and Analysis. New York: ACM, 2011: 52: 1−52: 12
[107]	Wang Lei, Tang Yuxing, Deng Yu, et al. A scalable and fast microprocessor design space exploration methodology[C]//Proc of the 9th Int Symp on Embedded Multicore/Many-core Systems-on-Chip. Los Alamitos, CA: IEEE Computer Society, 2015: 33–40
[108]	Lee J, Jang H, Kim J. RpStacks: Fast and accurate processor design space exploration using representative stall-event stacks[C]//Proc of the 47th Annual IEEE/ACM Int Symp on Microarchitecture. Los Alamitos, CA: IEEE Computer Society, 2014: 255–267
[109]	Jang H, Jo J E, Lee J, et al. RpStacks-MT: A high-throughput design evaluation methodology for multi-core processors[C]//Proc of the 51st Annual IEEE/ACM Int Symp on Microarchitecture. Los Alamitos, CA: IEEE Computer Society, 2018: 586–599
[110]	Noonburg D B, Shen J P. A framework for statistical modeling of superscalar processor performance[C]//Proc of the 3rd Int Symp on High-Performance Computer Architecture. Los Alamitos, CA: IEEE Computer Society, 1997: 298–309
[111]	Chen X E, Aamodt T M. A first-order fine-grained multithreaded throughput model[C]//Proc of the 15th Int Symp on High Performance Computer Architecture. Los Alamitos, CA: IEEE Computer Society, 2009: 329–340
[112]	Liang Y, Mitra T. An analytical approach for fast and accurate design space exploration of instruction caches[J]. ACM Transactions on Embedded Computing Systems, 2013, 13(3): 43: 1−43: 29
[113]	Hartstein A, Puzak T R. The optimum pipeline depth for a microprocessor[C]//Proc of the 29th Annual Int Symp on Computer Architecture. Los Alamitos, CA: IEEE Computer Society, 2002: 7–13
[114]	Chen X E, Aamodt T M. Hybrid analytical modeling of pending cache hits, data prefetching, and mshrs[J]. ACM Transactions on Architecture and Code Optimization, 2011, 8(3): 59−70
[115]	Li L, Pandey S, Flynn T, et al. SimNet: Accurate and high-performance computer architecture simulation using deep learning[C]//Proc of the 2022 ACM SIGMETRICS/IFIP Performance Joint Int Conf on Measurement and Modeling of Computer Systems. New York: ACM, 2022: 67–68
[116]	Panda R, John L K. Proxy benchmarks for emerging big-data workloads[C]//Proc of the 26th Int Conf on Parallel Architectures and Compilation Techniques. Los Alamitos, CA: IEEE Computer Society, 2017: 105–116
[117]	Kang S, Kumar R. Magellan: A search and machine learning-based framework for fast multi-core design space exploration and optimization[C]//Proc of the 2008 Design, Automation and Test in Europe. New York: ACM, 2008: 1432–1437
[118]	Guo Qi, Chen Tianshi, Zhou Zhihua, et al. Robust design space modeling[J]. ACM Transactions on Design Automation of Electronic Systems, 2015, 20(2): 18: 1–18: 22
[119]	张乾龙,侯锐,杨思博,等. 体系结构模拟器在处理器设计过程中的作用[J]. 计算机研究与发展,2019,56(12):2702−2719 doi: 10.7544/issn1000-1239.2019.20190044 Zhang Qianlong, Hou Rui, Yang Sibo, et al. The role of architecture simulators in the process of CPU design[J]. Journal of Computer Research and Development, 2019, 56(12): 2702−2719 (in Chinese) doi: 10.7544/issn1000-1239.2019.20190044
[120]	Hoste K, Eeckhout L. Microarchitecture-independent workload characterization[J]. IEEE Micro, 2007, 27(3): 63−72 doi: 10.1109/MM.2007.56
[121]	Jin Zhanpeng, Cheng A C. Evolutionary benchmark subsetting[J]. IEEE Micro, 2008, 28(6): 20−36 doi: 10.1109/MM.2008.87
[122]	Jin Zhanpeng, Cheng A C. SubsetTrio: An evolutionary, geometric, and statistical benchmark subsetting framework[J]. ACM Transactions on Modeling and Computer Simulation, 2011, 21(3): 21: 1–21: 23
[123]	Jin Zhanpeng, Cheng A C. Improve simulation efficiency using statistical benchmark subsetting: An implantbench case study[C]//Proc of the 45th Annual Design Automation Conf. New York: ACM, 2008: 970–973
[124]	Lee C, Potkonjak M, Mangione-Smith W H. MediaBench: A tool for evaluating and synthesizing multimedia and communications systems[C]//Proc of the 30th Annual Int Symp on Microarchitecture. Los Alamitos, CA: IEEE Computer Society, 1997: 330–335
[125]	Guthaus M R, Ringenberg J S, Ernst D, et al. MiBench: A free, commercially representative embedded benchmark suite[C]//Proc of the 4th Annual IEEE Int Workshop on Workload Characterization. Piscataway, NJ: IEEE, 2001: 3–14
[126]	Standard Performance Evaluation Corporation. SPEC CPU2000[EB/OL]. (2007-06-07)[2023-12-01]. https://www.spec.org/cpu2000
[127]	Standard Performance Evaluation Corporation. SPEC CPU2006[EB/OL]. (2023-01-06)[2023-12-01]. https://www.spec.org/cpu2006
[128]	Bienia C, Kumar S, Singh J P, et al. The parsec benchmark suite: Characterization and architectural implications[C]//Proc of the 17th Int Conf on Parallel Architectures and Compilation Techniques. New York: ACM, 2008: 72–81
[129]	Woo S C, Ohara M, Torrie E, et al. The splash−2 programs: Characterization and methodological considerations[C]//Proc of the 22nd Annual Int Symp on Computer architecture. New York: ACM, 1995: 24–36
[130]	Chandra D, Guo Fei, Kim S, et al. Predicting inter-thread cache contention on a chip multi-processor architecture[C]//Proc of the 11th Int Symp on High-Performance Computer Architecture. Los Alamitos, CA: IEEE Computer Society, 2005: 340–351
[131]	Hsu W C, Chen H, Yew P C, et al. On the predictability of program behavior using different input data sets[C]//Proc of the 6th Annual Workshop on Interaction between Compilers and Computer Architectures. Los Alamitos, CA: IEEE Computer Society, 2002: 45–53
[132]	Hoste K, Eeckhout L. Comparing benchmarks using key microarchitecture-independent characteristics[C]//Proc of the 2nd IEEE Int Symp on Workload Characterization. Los Alamitos, CA: IEEE Computer Society, 2006: 83–92
[133]	Yi J J, Sendag R, Eeckhout L, et al. Evaluating benchmark subsetting approaches[C]//Proc of the 2nd IEEE Int Symp on Workload Characterization. Los Alamitos, CA: IEEE Computer Society, 2006: 93–104
[134]	Conte T M, Hirsch M A, Menezes K N. Reducing state loss for effective trace sampling of superscalar processors[C]//Proc of the 14th Int Conf on Computer Design. Los Alamitos, CA: IEEE Computer Society, 1996: 468–477
[135]	Patil H, Cohn R, Charney M, et al. Pinpointing representative portions of large Intel^® Itanium^® programs with dynamic instrumentation[C]//Proc of the 37th Int Symp on Microarchitecture. Los Alamitos, CA: IEEE Computer Society, 2004: 81–92
[136]	Nair A A, John L K. Simulation points for SPEC CPU 2006[C]//Proc of the 26th Int Conf on Computer Design. Los Alamitos, CA: IEEE Computer Society, 2008: 397–403
[137]	Lau J, Perelman E, Calder B. Selecting software phase markers with code structure analysis[C]//Proc of the 4th Int Symp on Code Generation and Optimization. Los Alamitos, CA: IEEE Computer Society, 2006: 135–146
[138]	Lahiri K, Kunnoth S. Fast IPC estimation for performance projections using proxy suites and decision trees[C]//Proc of the 2017 IEEE Int Symp on Performance Analysis of Systems and Software. Los Alamitos, CA: IEEE Computer Society, 2017: 77–86
[139]	Carlson T E, Heirman W, Eeckhout L. Sampled simulation of multi-threaded applications[C]//Proc of the 2013 IEEE Int Symp on Performance Analysis of Systems and Software. Los Alamitos, CA: IEEE Computer Society, 2013: 2–12
[140]	Patil H, Pereira C, Stallcup M, et al. PinPlay: A framework for deterministic replay and reproducible analysis of parallel programs[C]//Proc of the 8th Annual IEEE/ACM Int Symp on Code Generation and Optimization. New York: ACM, 2010: 2–11
[141]	Patil H, Isaev A, Heirman W, et al. ELFies: Executable region checkpoints for performance analysis and simulation[C]//Proc of the 19th IEEE/ACM Int Symp on Code Generation and Optimization. Piscataway, NJ: IEEE, 2021: 126–136
[142]	Wenisch T F, Wunderlich R E, Falsafi B, et al. TurboSMARTS: Accurate microarchitecture simulation sampling in minutes[J]. ACM SIGMETRICS Performance Evaluation Review, 2005, 33(1): 408−409 doi: 10.1145/1071690.1064278
[143]	Khan T M, Pérez D G, Temam O. Transparent sampling[C]//Proc of the 10th Int Conf on Embedded Computer Systems: Architectures, Modeling and Simulation. Piscataway, NJ: IEEE, 2010: 28–36
[144]	Eeckhout L, Luo Yue, De Bosschere K, et al. BLRL: Accurate and efficient warmup for sampled processor simulation[J]. The Computer Journal, 2005, 48(4): 451−459 doi: 10.1093/comjnl/bxh103
[145]	Haskins J W, Skadron K. Accelerated warmup for sampled microarchitecture simulation[J]. ACM Transactions on Architecture and Code Optimization, 2005, 2(1): 78−108 doi: 10.1145/1061267.1061272
[146]	Van Ertvelde L, Hellebaut F, Eeckhout L. Accurate and efficient cache warmup for sampled processor simulation through NSL–BLRL[J]. The Computer Journal, 2008, 51(2): 192−206
[147]	Jiang Chuntao, Yu Zhibin, Jin Hai, et al. Shorter on-line warmup for sampled simulation of multi-threaded applications[C]//Proc of the 44th Int Conf on Parallel Processing. Los Alamitos, CA: IEEE Computer Society, 2015: 350–359
[148]	Bell R, Eeckhout L, John L, et al. Deconstructing and improving statistical simulation in HLS[C]//Proc of the 2004 Workshop on Duplicating, Deconstructing and Debunking held in Conjunction with the 31st Annual Int Symp on Computer Architecture. New York: ACM, 2004: 2−12
[149]	Joshi A, Yi J J, Bell R H, et al. Evaluating the efficacy of statistical simulation for design space exploration[C]//Proc of the 2006 IEEE Int Symp on Performance Analysis of Systems and Software. Los Alamitos, CA: IEEE Computer Society, 2006: 70–79
[150]	Eeckhout L, Bell R H, Stougie B, et al. Control flow modeling in statistical simulation for accurate and efficient processor design studies[C]//Proc of the 31st Annual Int Symp on Computer Architecture. Los Alamitos, CA: IEEE Computer Society, 2004: 350–361
[151]	Bell R H, Bhatia R R, John L K, et al. Automatic testcase synthesis and performance model validation for high performance PowerPC processors[C]//Proc of the 2006 IEEE Int Symp on Performance Analysis of Systems and Software. Los Alamitos, CA: IEEE Computer Society, 2006: 154–165
[152]	Lee H R, Sánchez D. Datamime: Generating representative benchmarks by automatically synthesizing datasets[C]//Proc of the 55th IEEE/ACM Int Symp on Microarchitecture. Piscataway, NJ: IEEE, 2022: 1144–1159
[153]	Joshi A, Eeckhout L, Bell R H, et al. Performance cloning: A technique for disseminating proprietary applications as benchmarks[C]//Proc of the 2nd IEEE Int Symp on Workload Characterization. Los Alamitos, CA: IEEE Computer Society, 2006: 105–115
[154]	Joshi A M, Eeckhout L, John L K, et al. Automated microprocessor stressmark generation[C]//Proc of the 14th Int Symp on High Performance Computer Architecture. Los Alamitos, CA: IEEE Computer Society, 2008: 229–239
[155]	Joshi A, Eeckhout L, Bell R H, et al. Distilling the essence of proprietary workloads into miniature benchmarks[J]. ACM Transactions on Architecture and Code Optimization, 2008, 5(2): 10: 1–10: 33
[156]	Ganesan K, John L K. Automatic generation of miniaturized synthetic proxies for target applications to efficiently design multicore processors[J]. IEEE Transactions on Computers, 2014, 63(4): 833−846 doi: 10.1109/TC.2013.36
[157]	Deniz E, Sen A, Kahne B, et al. MINIME: Pattern-aware multicore benchmark synthesizer[J]. IEEE Transactions on Computers, 2015, 64(8): 2239−2252 doi: 10.1109/TC.2014.2349522
[158]	Lee K, Evans S, Cho S. Accurately approximating superscalar processor performance from traces[C]//Proc of the 2009 IEEE Int Symp on Performance Analysis of Systems and Software. Los Alamitos, CA: IEEE Computer Society, 2009: 238–248
[159]	Lee K, Cho S. In-N-Out: Reproducing out-of-order superscalar processor behavior from reduced in-order traces[C]//Proc of the 19th Annual Int Symp on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems. Los Alamitos, CA: IEEE Computer Society, 2011: 126–135
[160]	Lee K, Cho S. Accurately modeling superscalar processor performance with reduced trace[J]. Journal of Parallel and Distributed Computing, 2013, 73(4): 509−521 doi: 10.1016/j.jpdc.2012.12.002
[161]	Ganesan K, Jo J, John L K. Synthesizing memory-level parallelism aware miniature clones for SPEC CPU2006 and ImplantBench workloads[C]//Proc of the 2010 IEEE Int Symp on Performance Analysis of Systems & Software. Los Alamitos, CA: IEEE Computer Society, 2010: 33–44
[162]	Panda R, Zheng Xinnian, John L K. Accurate address streams for LLC and beyond (SLAB): A methodology to enable system exploration[C]//Proc of the 2017 IEEE Int Symp on Performance Analysis of Systems and Software. Los Alamitos, CA: IEEE Computer Society, 2017: 87–96
[163]	Van Biesbrouck M, Sherwood T, Calder B. A co-phase matrix to guide simultaneous multithreading simulation[C]//Proc of the 2004 IEEE Int Symp on Performance Analysis of Systems and Software. Los Alamitos, CA: IEEE Computer Society, 2004: 45–56
[164]	Yi J J, Kodakara S V, Sendag R, et al. Characterizing and comparing prevailing simulation techniques[C]//Proc of the 11th Int Symp on High-Performance Computer Architecture. Los Alamitos, CA: IEEE Computer Society, 2005: 266–277
[165]	Tairum Cruz M, Bischoff S, Rusitoru R. Shifting the barrier: Extending the boundaries of the barrierpoint methodology[C]//Proc of the 2018 IEEE Int Symp on Performance Analysis of Systems and Software. Los Alamitos, CA: IEEE Computer Society, 2018: 120–122
[166]	Bell R H, John L K. Efficient power analysis using synthetic testcases[C]//Proc of the 1st IEEE Int Symp Workload Characterization. Piscataway, NJ: IEEE, 2005: 110–118
[167]	Penry D A, Fay D, Hodgdon D, et al. Exploiting parallelism and structure to accelerate the simulation of chip multi-processors[C]//Proc of the 12th Int Symp on High-Performance Computer Architecture. Los Alamitos, CA: IEEE Computer Society, 2006: 29–40
[168]	Mariani G, Palermo G, Zaccaria V, et al. DeSpErate: Speeding-up design space exploration by using predictive simulation scheduling[C/OL]//Proc of the 17th Design, Automation & Test in Europe Conf & Exhibition. Piscataway, NJ: IEEE, 2014[2023-12-18]. https://ieeexplore.ieee.org/document/6800432?arnumber=6800432
[169]	Li Bin, Peng Lu, Ramadass B. Accurate and efficient processor performance prediction via regression tree based modeling[J]. Journal of Systems Architecture, 2009, 55(10): 457−467
[170]	Pang Jiufeng, Li Xiafeng, Xie Jinsong, et al. Microarchitectural design space exploration via support vector machine[J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2010, 46(1): 55−63
[171]	Cook H, Skadron K. Predictive design space exploration using genetically programmed response surfaces[C]//Proc of the 45th Annual Design Automation Conf. New York: ACM, 2008: 960–965
[172]	Zhai Jianwang, Bai Chen, Zhu Binwu, et al. McPAT-Calib: A microarchitecture power modeling framework for modern CPUs[C/OL]//Proc of the 40th IEEE/ACM Int Conf on Computer Aided Design. Piscataway, NJ: IEEE, 2021[2023-12-18]. https://ieeexplore.ieee.org/document/9643508
[173]	Zhai Jianwang, Bai Chen, Zhu Binwu, et al. McPAT-calib: A RISC-V boom microarchitecture power modeling framework[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2023, 42(1): 243−256 doi: 10.1109/TCAD.2022.3169464
[174]	Givargis T, Vahid F, Henkel J. System-level exploration for Pareto-optimal configurations in parameterized systems-on-a-chip[C]//Proc of the 20th IEEE/ACM Int Conf on Computer Aided Design. Los Alamitos, CA: IEEE Computer Society, 2001: 25–30
[175]	Yazdani R, Sheidaeian H, Salehi M E. A fast design space exploration for VLIW architectures[C]//Proc of the 22nd Iranian Conf on Electrical Engineering. Piscataway, NJ: IEEE, 2014: 856–861
[176]	Kansakar P, Munir A. A design space exploration methodology for parameter optimization in multicore processors[J]. IEEE Transactions on Parallel and Distributed Systems, 2018, 29(1): 2−15 doi: 10.1109/TPDS.2017.2745580
[177]	Ascia G, Catania V, Di Nuovo A G, et al. Performance evaluation of efficient multi-objective evolutionary algorithms for design space exploration of embedded computer systems[J]. Applied Soft Computing, 2011, 11(1): 382−398 doi: 10.1016/j.asoc.2009.11.029
[178]	Mariani G, Palermo G, Silvano C, et al. An efficient design space exploration methodology for multi-cluster VLIW architectures based on artificial neural networks[C]//Proc of the 16th IFIP/IEEE Int Conf on Very Large Scale Integration. Piscataway, NJ: IEEE, 2008: 13−15
[179]	Zaccaria V, Palermo G, Castro F, et al. MULTICUBE Explorer: An open source framework for design space exploration of chip multi-processors[C]//Proc of the 23rd Int Conf on Architecture of Computing Systems. Hannover, Germany: VDE Verlag, 2010: 325–331
[180]	Mariani G, Brankovic A, Palermo G, et al. A correlation-based design space exploration methodology for multi-processor systems-on-chip[C]//Proc of the 47th Design Automation Conf. New York: ACM, 2010: 120–125
[181]	Wang Duo, Yan Mingyu, Liu Xin, et al. A high-accurate multi-objective exploration framework for design space of CPU[C/OL]//Proc of the 60th ACM/IEEE Design Automation Conf. Piscataway, NJ: IEEE, 2023[2023-12-18]. https://ieeexplore.ieee.org/document/10247790
[182]	Wang Duo, Yan Mingyu, Teng Yihan, et al. A high-accurate multi-objective ensemble exploration framework for design space of CPU microarchitecture[C]//Proc of the 33rd Great Lakes Symp on VLSI 2023. New York: ACM, 2023: 379–383
[183]	Beltrame G, Fossati L, Sciuto D. Decision-theoretic design space exploration of multiprocessor platforms[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2010, 29(7): 1083−1095
[184]	Beltrame G, Nicolescu G. A multi-objective decision-theoretic exploration algorithm for platform-based design[C]//Proc of the 14th Design, Automation & Test in Europe Conf & Exhibition. Piscataway, NJ: IEEE, 2011: 1192−1195
[185]	Sheldon D, Vahid F, Lonardi S. Soft-core processor customization using the design of experiments paradigm[C]//Proc of the 10th Design, Automation & Test in Europe Conf & Exhibition. Piscataway, NJ: IEEE, 2007: 821−826
[186]	Mariani G, Palermo G, Silvano C, et al. Meta-model assisted optimization for design space exploration of multi-processor systems-on-chip[C]//Proc of the 12th Euromicro Conf on Digital System Design, Architectures, Methods and Tools. Los Alamitos, CA: IEEE Computer Society, 2009: 383–389
[187]	Palermo G, Silvano C, Zaccaria V. Multi-objective design space exploration of embedded systems[J]. Journal of Embedded Computing, 2005, 1(3): 305−316
[188]	Wu Nan, Xie Yuan, Hao Cong. IronMan: GNN-assisted design space exploration in high-level synthesis via reinforcement learning[C]//Proc of the 31st Great Lakes Symp on VLSI. New York: ACM, 2021: 39–44
[189]	Wu Nan, Xie Yuan, Hao Cong. IronMan-Pro: Multiobjective design space exploration in HLS via reinforcement learning and graph neural network-based modeling[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2023, 42(3): 900−913 doi: 10.1109/TCAD.2022.3185540
[190]	Kao S C, Jeong G, Krishna T. ConfuciuX: Autonomous hardware resource assignment for DNN accelerators using reinforcement learning[C]//Proc of the 53rd Annual IEEE/ACM Int Symp on Microarchitecture. Piscataway, NJ: IEEE, 2020: 622–636
[191]	Feng Lang, Liu Wenjian, Guo Chuliang, et al. GANDSE: Generative adversarial network based design space exploration for neural network accelerator design[J]. ACM Transactions on Design Automation of Electronic Systems, 2023, 28(3): 35: 1−35: 20
[192]	Akram A, Sawalha L. A survey of computer architecture simulation techniques and tools[J]. IEEE Access, 2019, 7: 78120−78145 doi: 10.1109/ACCESS.2019.2917698
[193]	Manjikian N. Multiprocessor enhancements of the simplescalar tool set[J]. SIGARCH Computer Architecture News, 2001, 29(1): 8−15 doi: 10.1145/373574.373578
[194]	Qureshi Y M, Simon W A, Zapater M, et al. Gem5-X: A many-core heterogeneous simulation platform for architectural exploration and optimization[J]. ACM Transactions on Architecture and Code Optimization, 2021, 18(4): 44: 1–44: 27
[195]	Carlson T E, Heirman W, Eyerman S, et al. An evaluation of high-level mechanistic core models[J]. ACM Transactions on Architecture and Code Optimization, 2014, 11(3): 28: 1–28: 25
[196]	Tan Zhangxi, Waterman A, Cook H, et al. A case for fame: FPGA architecture model execution[C]//Proc of the 37th Annual Int Symp on Computer Architecture. New York: ACM, 2010: 290–301
[197]	Lee Y, Waterman A, Cook H, et al. An agile approach to building RISC-V microprocessors[J]. IEEE Micro, 2016, 36(2): 8−20 doi: 10.1109/MM.2016.11
[198]	Di Biagio A, Davis M. llvm-mca: A static performance analysis tool[EB/OL]. (2018−03−01)[2023-12-01]. https://lists.llvm.org/pipermail/llvm-dev/2018-March/121490.html
[199]	Mendis C, Renda A, Amarasinghe D S, et al. Ithemal: Accurate, portable and fast basic block throughput estimation using deep neural networks[C]//Proc of the 36th Int Conf on Machine Learning. New York: PMLR, 2019: 4505–4515
[200]	Blocklove J, Garg S, Karri R, et al. Chip-Chat: Challenges and opportunities in conversational hardware design[C/OL]//Proc of the 5th ACM/IEEE Workshop on Machine Learning for CAD. Piscataway, NJ: IEEE, 2023[2023-12-18]. https://ieeexplore.ieee.org/document/10299874
[201]	Chang Kaiyan, Wang Ying, Ren Haimeng, et al. ChipGPT: How far are we from natural language hardware design[J]. arXiv preprint, arXiv: 2305.14019, 2023
[202]	Lu Yao, Liu Shang, Zhang Qijun, et al. RTLLM: An open-source benchmark for design RTL generation with large language model[J]. arXiv preprint, arXiv: 2308.05345, 2023
[203]	Balkind J, Chang Tingjung, Jackson P J, et al. OpenPiton at 5: A nexus for open and agile hardware design[J]. IEEE Micro, 2020, 40(4): 22−31 doi: 10.1109/MM.2020.2997706
[204]	Bachrach J, Vo H, Richards B, et al. Chisel: Constructing hardware in a scala embedded language[C]//Proc of the 49th Annual Design Automation Conf. New York: ACM, 2012: 1216–1225
[205]	Patel H D, Shukla S K. Tackling an abstraction gap: Co-simulating SystemC DE with Bluespec ESL[C]//Proc the of 10th Design, Automation & Test in Europe Conf & Exhibition. Piscataway, NJ: IEEE, 2007: 279−284
[206]	Bourgeat T, Pit-Claudel C, Chlipala A, et al. The essence of Bluespec: A core language for rule-based hardware design[C]//Proc of the 41st ACM SIGPLAN Conf on Programming Language Design and Implementation. New York: ACM, 2020: 243–257
[207]	Käyrä M, Hämäläinen T D. A survey on system-on-a-chip design using Chisel HW construction language[C/OL]//Proc of the 47th Annual Conf of the IEEE Industrial Electronics Society. Piscataway, NJ: IEEE, 2021[2023-12-18]. https://ieeexplore.ieee.org/document/9589614
[208]	王凯帆,徐易难,余子濠,等. 香山开源高性能RISC-v处理器设计与实现[J]. 计算机研究与发展,2023,60(3):476−493 doi: 10.7544/issn1000-1239.202221036 Wang Kaifan, Xu Yinan, Yu Zihao, et al. XiangShan open-source high performance RISC-V processor design and implementation[J]. Journal of Computer Research and Development, 2023, 60(3): 476−493(in Chinese) doi: 10.7544/issn1000-1239.202221036
[209]	Lee B C, Brooks D M, Supinski B R de, et al. Methods of inference and learning for performance modeling of parallel applications[C]//Proc of the 12th ACM SIGPLAN Sympon Principles and Practice of Parallel Programming. New York: ACM, 2007: 249–258
[210]	Hallschmid P, Saleh R. Fast design space exploration using local regression modeling with application to ASIPs[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2008, 27(3): 508−515 doi: 10.1109/TCAD.2008.915532
[211]	Zhang Changshu, Ravindran A, Datta K, et al. A machine learning approach to modeling power and performance of chip multiprocessors[C]//Proc of the 29th Int Conf on Computer Design. Los Alamitos, CA: IEEE Computer Society, 2011: 45–50
[212]	Beg A, Prasad P W C, Singh A K, et al. A neural model for processor-throughput using hardware parameters and software’s dynamic behavior[C]//Proc of the 12th Int Conf on Intelligent Systems Design and Applications. Piscataway, NJ: IEEE, 2012: 821–825
[213]	Paone E, Vahabi N, Zaccaria V, et al. Improving simulation speed and accuracy for many-core embedded platforms with ensemble models[C]//Proc of the 16th Design, Automation & Test in Europe Conf & Exhibition. Piscataway, NJ: IEEE, 2013: 671–676
[214]	Castillo P A, Mora A M, Guervós J J M, et al. Architecture performance prediction using evolutionary artificial neural networks[C]//Proc of the Applications of Evolutionary Computing. Berlin: Springer, 2008: 175–183
[215]	Khan S, Xekalakis P, Cavazos J, et al. Using predictive modeling for cross-program design space exploration in multicore systems[C]//Proc of the 16th Int Conf on Parallel Architecture and Compilation Techniques. Los Alamitos, CA: IEEE Computer Society, 2007: 327–338
[216]	Dubach C, Jones T M, O’Boyle M F P. Rapid early-stage microarchitecture design using predictive models[C]//Proc of the 27th Int Conf on Computer Design. Los Alamitos, CA: IEEE Computer Society, 2009: 297–304
[217]	Özisikyilmaz B, Memik G, Choudhary A N. Machine learning models to predict performance of computer system design alternatives[C]//Proc of the 37th Int Conf on Parallel Processing. Los Alamitos, CA: IEEE Computer Society, 2008: 495–502
[218]	Özisikyilmaz B, Memik G, Choudhary A N. Efficient system design space exploration using machine learning techniques[C]//Proc of the 45th Design Automation Conf. New York: ACM, 2008: 966–969
[219]	Ghosh A, Givargis T. Cache optimization for embedded processor cores: An analytical approach[J]. ACM Transactions on Design Automation of Electronic Systems, 2004, 9(4): 419−440 doi: 10.1145/1027084.1027086
[220]	Li Sheng, Chen Ke, Ahn J H, et al. CACTI-p: Architecture-level modeling for SRAM-based structures with advanced leakage reduction techniques[C]//Proc of the 30th Int Conf on Computer-Aided Design. Los Alamitos, CA: IEEE Computer Society, 2011: 694–701
[221]	Li Sheng, Ahn J H, Strong R D, et al. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures[C]//Proc of the 42nd Annual IEEE/ACM Int Symp on Microarchitecture. New York: ACM, 2009: 469–480
[222]	Karkhanis T S, Smith J E. A first-order superscalar processor model[C]//Proc of the 31st Annual Int Symp on Computer Architecture. Piscataway, NJ: IEEE, 2004: 338–349
[223]	Genbrugge D, Eyerman S, Eeckhout L. Interval simulation: Raising the level of abstraction in architectural simulation[C/OL]//Proc of the 16th Int Symp on High-Performance Computer Architecture. Piscataway, NJ: IEEE, 2010[2023-12-18]. https://ieeexplore.ieee.org/document/5416636
[224]	Breughe M, Eyerman S, Eeckhout L. A mechanistic performance model for superscalar in-order processors[C]//Proc of the 2012 IEEE Int Symp on Performance Analysis of Systems & Software. Los Alamitos, CA: IEEE Computer Society, 2012: 14–24
[225]	Van den Steen S, Eyerman S, De Pestel S, et al. Analytical processor performance and power modeling using micro-architecture independent characteristics[J]. IEEE Transactions on Computers, 2016, 65(12): 3537−3551
[226]	De Pestel S, Van den Steen S, Akram S, et al. RPPM: Rapid performance prediction of multithreaded workloads on multicore processors[C]//Proc of the 2019 IEEE Int Symp on Performance Analysis of Systems and Software. Piscataway, NJ: IEEE, 2019: 257–267
[227]	Jongerius R, Mariani G, Anghel A, et al. Analytic processor model for fast design-space exploration[C]//Proc of the 33rd IEEE Int Conf on Computer Design. Los Alamitos, CA: IEEE Computer Society, 2015: 411–414
[228]	Jongerius R, Anghel A, Dittmann G, et al. Analytic multi-core processor model for fast design-space exploration[J]. IEEE Transactions on Computers, 2018, 67(6): 755−770 doi: 10.1109/TC.2017.2780239
[229]	Kwon J, Carloni L P. Transfer learning for design-space exploration with high-level synthesis[C]//Proc of the 2nd ACM/IEEE Workshop on Machine Learning for CAD. New York: ACM, 2020: 163–168
[230]	Zhang Zheng, Chen Tinghuan, Huang Jiaxin, et al. A fast parameter tuning framework via transfer learning and multi-objective Bayesian optimization[C]//Proc of the 59th ACM/IEEE Design Automation Conf. New York: ACM, 2022: 133–138
[231]	Zhang Keyi, Asgar Z, Horowitz M. Bringing source-level debugging frameworks to hardware generators[C]//Proc of the 59th ACM/IEEE Design Automation Conf. New York: ACM, 2022: 1171–1176
[232]	Xiao Qingcheng, Zheng Size, Wu Bingzhe, et al. HASCO: Towards agile hardware and software co-design for tensor computation[C]//Proc of the 48th Annual Int Symp on Computer Architecture. Piscataway, NJ: IEEE, 2021: 1055–1068
[233]	Esmaeilzadeh H, Ghodrati S, Kahng A B, et al. Physically accurate learning-based performance prediction of hardware-accelerated ML algorithms[C]//Proc of the 4th ACM/IEEE Workshop on Machine Learning for CAD. New York: ACM, 2022: 119–126
[234]	Sun Qi, Chen Tinghuan, Liu Siting, et al. Correlated multi-objective multi-fidelity optimization for HLS directives design[C]//Proc of the 24th Design, Automation & Test in Europe Conf & Exhibition. Piscataway, NJ: IEEE, 2021: 46–51
[235]	Wu Y N, Tsai P A, Parashar A, et al. Sparseloop: An analytical approach to sparse tensor accelerator modeling[C]//Proc of the 55th IEEE/ACM Int Symp on Microarchitecture. Piscataway, NJ: IEEE, 2022: 1377–1395
[236]	Huang Qijing, Kang M, Dinh G, et al. CoSA: Scheduling by constrained optimization for spatial accelerators[C]//Proc of the 48th Annual Int Symp on Computer Architecture. Piscataway, NJ: IEEE, 2021: 554–566
[237]	Mei Linyan, Houshmand P, Jain V, et al. ZigZag: Enlarging joint architecture-mapping design space exploration for DNN accelerators[J]. IEEE Transactions on Computers, 2021, 70(8): 1160−1174 doi: 10.1109/TC.2021.3059962

施引文献(4)

期刊类型引用(0)

其他类型引用(4)

资源附件(0)

图(5) / 表(19)

计量

文章访问数: 287
HTML全文浏览量: 94
PDF下载量: 141
被引次数: 4

1. 相关工作
1.1 跨域推荐
1.2 基于GCN的推荐
1.3 宽度学习系统（BLS）
2. GBCD模型
2.1 跨域特征提取
2.1.1 构造多部图
2.1.2 基于多部图的图卷积神经网络
2.2 跨域BLS
2.2.1 映射特征层
2.2.2 特征增强层
2.2.3 输出层
2.3 损失函数及方法
3. 实　　验
3.1 实验设置
3.1.1 数据集和评估指标
3.1.2 对比方法
3.1.3 实施细节
3.2 性能比较(研究问题1)
3.3 消融实验（研究问题2）
3.4 超参数分析（研究问题3）
3.4.1 特征映射模块中的超参数
3.4.2 特征增强模块中的超参数
3.5 案例分析
3.6 可视化分析
4. 结　　论

1. 相关工作
1.1 跨域推荐
1.2 基于GCN的推荐
1.3 宽度学习系统（BLS）
2. GBCD模型
2.1 跨域特征提取
2.1.1 构造多部图
2.1.2 基于多部图的图卷积神经网络
2.2 跨域BLS
2.2.1 映射特征层
2.2.2 特征增强层
2.2.3 输出层
2.3 损失函数及方法
3. 实　　验
3.1 实验设置
3.1.1 数据集和评估指标
3.1.2 对比方法
3.1.3 实施细节
3.2 性能比较(研究问题1)
3.3 消融实验（研究问题2）
3.4 超参数分析（研究问题3）
3.4.1 特征映射模块中的超参数
3.4.2 特征增强模块中的超参数
3.5 案例分析
3.6 可视化分析
4. 结　　论

参考文献(237)

施引文献(4)

资源附件(0)

面向处理器微架构设计空间探索的加速方法综述

通讯作者: 严明玉（yanmingyu@ict.ac.cn）

计量

出版历程

Acceleration Methods for Processor Microarchitecture Design Space Exploration: A Survey

1. 相关工作

1.1 跨域推荐

1.2 基于GCN的推荐

1.3 宽度学习系统（BLS）

2. GBCD模型

2.1 跨域特征提取

2.1.1 构造多部图

2.1.2 基于多部图的图卷积神经网络

2.2 跨域BLS

2.2.1 映射特征层

2.2.2 特征增强层

2.2.3 输出层

2.3 损失函数及方法

3. 实 验

3.1 实验设置

3.1.1 数据集和评估指标

3.1.2 对比方法

3.1.3 实施细节

3.2 性能比较(研究问题1)

3.3 消融实验（研究问题2）

3.4 超参数分析（研究问题3）

3.4.1 特征映射模块中的超参数

3.4.2 特征增强模块中的超参数

3.5 案例分析

3.6 可视化分析

4. 结 论

期刊类型引用(0)

其他类型引用(4)

计量

出版历程

目录

1. 相关工作

1.1 跨域推荐

1.2 基于GCN的推荐

1.3 宽度学习系统（BLS）

2. GBCD模型

2.1 跨域特征提取

2.1.1 构造多部图

2.1.2 基于多部图的图卷积神经网络

2.2 跨域BLS

2.2.1 映射特征层

2.2.2 特征增强层

2.2.3 输出层

2.3 损失函数及方法

3. 实 验

3.1 实验设置

3.1.1 数据集和评估指标

3.1.2 对比方法

3.1.3 实施细节

3.2 性能比较(研究问题1)

3.3 消融实验（研究问题2）

3.4 超参数分析（研究问题3）

3.4.1 特征映射模块中的超参数

3.4.2 特征增强模块中的超参数

3.5 案例分析

3.6 可视化分析

4. 结 论

通讯作者:
严明玉（yanmingyu@ict.ac.cn）

3. 实　　验

4. 结　　论

3. 实　　验

4. 结　　论