HyWarm：针对处理器 RTL仿真的自适应混合预热方法

周耀阳; 韩博阳; 蔺嘉炜; 王凯帆; 张林隽; 余子濠; 唐丹; 王卅; 孙凝晖; 包云岗

doi:10.7544/issn1000-1239.202330061

HyWarm：针对处理器 RTL仿真的自适应混合预热方法

周耀阳^{1, 2, 3,},
韩博阳⁴,
蔺嘉炜^{1, 2, 3},
王凯帆^{1, 2, 3},
张林隽^{1, 2, 3},
余子濠¹,
唐丹^{1, 3},
王卅^{1, 2},
孙凝晖^{1, 2, 3},
包云岗^{1, 2, ,}

1.
处理器芯片全国重点实验室（中国科学院计算技术研究所）　北京　100190
2.
中国科学院大学计算机科学与技术学院　北京　100049
3.
北京开源芯片研究院　北京　100080
4.
香港大学电机电子工程系　香港　999077

基金项目: 中国科学院战略性先导科技专项（XDC05030200）, 国家自然科学基金重大项目（62090020）

详细信息

作者简介:
周耀阳: 1995年生. 博士. 主要研究方向为处理器ILP提升、可扩展处理器设计、负载采样和性能评测方法

韩博阳: 1999年生. 工程硕士研究生. 主要研究方向为计算机体系结构、数字系统设计和高速串行通讯协议

蔺嘉炜: 1998年生. 硕士研究生. 主要研究方向为高性能计算机体系结构

王凯帆: 1997年生. 博士研究生. 主要研究方向为处理器敏捷开发与计算机体系结构

张林隽: 1998年生. 硕士研究生. 主要研究方向为高性能计算机体系结构

余子濠: 1991年生. 博士. 主要研究方向为计算机系统结构和操作系统

唐丹: 1976年生. 博士，高级工程师. 主要研究方向为计算机体系结构和低功耗SoC设计

王卅: 1986年生. 博士，副研究员. 主要研究方向为云计算、操作系统以及系统建模与性能分析

孙凝晖: 1968年生. 博士，中国工程院院士，CCF会士. 主要研究方向为计算机系统结构、高性能计算

包云岗: 1980年生. 博士，研究员. 主要研究方向为数据中心体系结构、处理器芯片敏捷设计方法论、开源处理器芯片生态

通讯作者:
包云岗（baoyg@ict.ac.cn）

中图分类号: TP391
计量
- 文章访问数: 340
- HTML全文浏览量: 69
- PDF下载量: 145
出版历程
- 收稿日期: 2023-01-09
- 修回日期: 2023-04-14
- 网络出版日期: 2023-05-03
- 刊出日期: 2023-05-31

HyWarm: Adaptive Hybrid Warmup Method for RTL Emulation of Processors

Zhou Yaoyang^{1, 2, 3,},
Han Boyang⁴,
Lin Jiawei^{1, 2, 3},
Wang Kaifan^{1, 2, 3},
Zhang Linjuan^{1, 2, 3},
Yu Zihao¹,
Tang Dan^{1, 3},
Wang Sa^{1, 2},
Sun Ninghui^{1, 2, 3},
Bao Yungang^{1, 2, ,}

1.
State Key Lab of Processors (Institute of Computing Technology, Chinese Academy of Sciences), Beijing 100190
2.
School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing 100049
3.
Beijing Institute of Open Source Chip, Beijing 100080
4.
Department of Electrical and Electronic Engineering, The University of Hong Kong, Hong Kong 999077

Funds: This work was supported by the Strategic Priority Research Program of Chinese Academy of Sciences (XDC05030200), and the Major Program of the National Natural Science Foundation of China (62090020).

More Information

Author Bio:
Zhou Yaoyang: born in 1995. PhD. His main research interests include CPU ILP enhancement, scalable CPU design, workload sampling, and performance evaluation methods

Han Boyang: born in 1999. Master candidate of Science in Engineering. His main research interests include computer architecture, digital system design, and high-speed serial communication protocols

Lin Jiawei: born in 1998. Master candidate. His research interest includes high-performance computer architecture

Wang Kaifan: born in 1997. PhD candidate. His main research interests include agile development of processors and computer architecture

Zhang Linjuan: born in 1998. Master candidate. Her main research interest includes high-performance computer architecture

Yu Zihao: born in 1991. PhD. His main research interests include computer architecture and operating system

Tang Dan: born in 1976, PhD, senior engineer. His main research interests include computer architecture and low power SoC design

Wang Sa: born in 1986. PhD, associate professor. His main research interests include cloud computing, operating systems, and system modeling and performance analysis

Sun Ninghui: born in 1968. PhD, academician of Chinese Academy of Engineering, fellow of CCF. His main research interests include computer architecture and high performance computing

Bao Yungang: born in 1980. PhD, professor. His main research interests include data-center architecture, agile design methodology of processor chips and ecosystem of open-source processor chips

摘要

摘要:
在高性能处理器开发中，准确而快速的性能估算是设计决策和参数选择的基础. 现有工作通过采样算法和RTL的体系结构检查点加速了处理器RTL仿真，使得在数天内测算复杂高性能处理器的SPECCPU等基准测试的性能成为可能. 但是数天的迭代周期仍然过长，性能测算周期仍然有进一步缩短的空间. 在处理器RTL仿真过程中，预热过程的时间占比很大. HyWarm框架的提出是为了加速性能测算过程中的预热过程. HyWarm通过微结构模拟器分析负载预热需求，为每个负载定制预热方案. 对于缓存预热需求较大的负载，HyWarm通过总线协议进行RTL缓存的功能预热；对于RTL全细节仿真，HyWarm利用CPU分簇和LJF调度缩短最大完成时间. HyWarm相较于现有最好的RTL采样仿真方法，在与基准方法准确率相似的前提下，将仿真完成时间缩短了53%.
- 高性能处理器 /
- 芯片设计 /
- 敏捷开发 /
- 负载采样 /
- 功能预热
Abstract:
When developing high-performance processors, accurate and fast performance estimation is the basis for design decisions and parameter exploration. Prior work accelerates processor RTL emulation through workload sampling and architectural checkpoints for RTL, which makes it possible to estimate the performance of benchmarks such as SPECCPU running on complex high-performance processors within a few days. However, waiting a few days for performance results is still too long for architecture iteration, and there is still room for further shortening the performance measurement cycle. During RTL emulation of processors, the warm up phase consumes a significant amount of time. As a solution to expedite the warm up phase during performance evaluation, the HyWarm framework is developed. HyWarm analyzes the warm up demand of workloads with the micro-architectural simulator, and adaptively customizes the warm up scheme for each workload. For workloads with high warm up demand on caches, HyWarm performs functional warm up through the caches’ bus protocol on RTL. For detailed emulation part, HyWarm utilizes CPU clustering and LJF scheduling to reduce the maximum completion time. Compared with the best existing sampling-based RTL emulation method, HyWarm reduces the emulation completion time by 53% under the premise of similar accuracy to the baseline method.
- high performance processor /
- chip design /
- agile development /
- workload sampling /
- functional warm up

HTML全文

在实现分布式数据库的技术方案上，业界存在不同的选择. 第一种方式需要对应用系统进行拆分，通过分库分表将原本单个数据库管理的数据分散到多个集中式数据库. 分库分表方案要求应用系统重构，跨库访问效率较低，关系数据库的重要功能，如外键、全局唯一性约束、全局索引等无法使用. 第二种方式是对传统集中式关系数据库进行分布式改造，增加分布式事务处理，小规模集群部署下的自动故障恢复等功能. 这类分布式数据库由于存储系统、事务处理和SQL优化器等源自集中式架构，在分布式场景下面临功能和性能上的诸多限制. 第三种方式是从头开始设计和实现一个原生分布式关系数据库，将分布式作为基本特性融入存储系统、事务处理和SQL优化器等关键组件. 相比前两种方案，原生分布式数据库在高可用、数据一致性、事务性能、弹性伸缩、快速无损的故障恢复等方面有着更大的优势.

OceanBase是一个从头开始设计与实现的分布式关系数据库系统. OceanBase因淘宝而诞生，因支付宝而发展和壮大，如今已在金融、政务、通信和互联网等领域得到广泛应用. 由OceanBase首席科学家阳振坤领衔的分布式数据库研发团队实现了多项技术创新和突破，该团队撰写的论文“OceanBase分布式关系数据库架构与技术”介绍了OceanBase的分布式架构，分布式事务处理、存储引擎、SQL优化、多租户机制等关键技术，具体总结如下：

1）设计了强一致、高可用、可扩展的分布式事务处理机制，实现了单机/单机房故障的自动、无损、快速的故障恢复；

2）提出了单机/分布式一体化关系数据库架构，实现了关系数据库容量和处理能力从单机数据库到分布式数据库的无缝切换和伸缩；

3）实现了关系数据库的性能无损的高倍率数据压缩，论文实验展示了数据压缩倍率是主流关系数据库的3倍甚至更高；

4）实现了单数据库系统同时支持高性能事务处理和实时分析处理，典型场景的事务处理性能和分析处理性能都高于MySQL.

OceanBase是迄今为止唯一同时获得了TPC-C和TPC-H性能榜首的数据库. 尽管关系数据库的提出已经过去了半个世纪之久，真正意义上的分布式关系数据库时代才刚刚开始，论文不仅展示了OceanBase采用的分布式数据库关键技术，也对未来分布式数据库的发展方向提出了展望. 我相信，这篇论文能引发很多关于数据库发展方向的思考，对于从事相关研究和开发的工程技术人员和数据库应用领域的专业人士都有重要的参考价值.

评述专家

周傲英，教授，博士生导师. 主要研究方向为Web数据管理、数据密集型计算、内存集群计算、分布事务处理、大数据基准测试和性能优化.

亮点论文

阳振坤，杨传辉，韩富晟，王国平，杨志丰，成肖君. OceanBase分布式关系数据库架构与技术[J]. 计算机研究与发展，2024，61（3）：540−554. DOI:10.7544/issn1000-1239.202330835

图 1 现有的基于采样的仿真方法

Figure 1. Existing sampling-based simulation methods

下载: 全尺寸图片幻灯片

图 2 来自SPECCPU^® 2006的492个检查点的仿真时间分布

Figure 2. Emulation time distribution of 492 checkpoints from SPECCPU^® 2006

下载: 全尺寸图片幻灯片

图 3 HyWarm的优化概览：将现存固定预热长度分为3段

Figure 3. Optimization overview of HyWarm: Existing fixed warm up duration is divided into three segments

下载: 全尺寸图片幻灯片

图 4 主流的基于采样的仿真方法

Figure 4. Mainstream sampling-based simulation methods

下载: 全尺寸图片幻灯片

图 5 sjeng的预热需求曲线

Figure 5. Warm up demand curve of sjeng

下载: 全尺寸图片幻灯片

图 6 预热长度搜索过程

Figure 6. Warm up length search process

下载: 全尺寸图片幻灯片

图 7 GEM5模拟器与香山处理器的分支预测器预热需求

Figure 7. Warm up demand of branch predictors in GEM5 simulator and Xiangshan processor

下载: 全尺寸图片幻灯片

图 8 开启Verilator多线程对调度策略的影响

Figure 8. Impact of enabling multi-threading in Verilator on scheduling policy

下载: 全尺寸图片幻灯片

图 9 不同的调度策略下最大完成时间对比

Figure 9. Comparison of maximum completion time under different scheduling policies.

下载: 全尺寸图片幻灯片

图 10 HyWarm工作流程

Figure 10. Workflow of HyWarm

下载: 全尺寸图片幻灯片

图 11 Filter模式的工作流程

Figure 11. Workflow of Filter mode

下载: 全尺寸图片幻灯片

图 12 接收TileLink请求的缓存子系统

Figure 12. Cache subsystem that receives TileLink requests

下载: 全尺寸图片幻灯片

图 13 检查点的预热需求（指令数）分布

Figure 13. Distribution of warm up demand (the number of instructions) or checkpoints.

下载: 全尺寸图片幻灯片

图 14 GEM5模拟器与香山处理器的预热需求曲线

Figure 14. Warm up demand curve of GEM5 simulator and Xiangshan processor

下载: 全尺寸图片幻灯片

图 15 不同预热方案对L1MP的影响

Figure 15. Impact of different warm up schemes on L1MP

下载: 全尺寸图片幻灯片

图 16 不同预热方案对分支MPKI的影响

Figure 16. Impact of different warm up schemes on branch MPKI

下载: 全尺寸图片幻灯片

图 17 不同预热方案对CPI的影响

Figure 17. Impact of different warm up schemes on CPI

下载: 全尺寸图片幻灯片

图 18 使用自适应预热时53个负载的全细节仿真周期数分布

Figure 18. Distribution of total detailed simulation cycle counts for 53 workloads using adaptive warm up

下载: 全尺寸图片幻灯片

表 1 在AMD EPYC 7H12 64核服务器上运行不同并行任务数的Verilator的仿真速度

Table 1 Emulation Speed of Verilator with Different Parallelism on AMD EPYC 7H12 Server with 64 Cores

仿真速度/IPS	4线程单任务	4线程16任务	满载性能损失
单任务	2153.13	1189.31
每核	538.28	297.33	45%

下载: 导出CSV

表 2 常用的RTL性能评估方法对比

Table 2 Comparison of Commonly Used RTL Performance Evaluation Methods

RTL性能评估方法	仿真频率	典型价格/CNY	是否可租用	典型可容纳设计
RTL软件仿真器	$\leqslant$ 1kHz	5−10万	是	可容纳商业级SoC
公有云FPGA	$\leqslant$ 100MHz	每天240−3600	是	Boom处理器
私有FPGA	$\leqslant$ 100MHz	$\leqslant$ 40万	否	香山处理器
硬件仿真加速器	$\leqslant$ 1MHz	>1000万	否	可容纳商业级SoC

下载: 导出CSV

表 3 服务器低负载时Verilator仿真的多线程扩展效率对比

Table 3 Comparison of Multi-threading Scaling Efficiency of Verilator Emulation When Server Load is Low

线程数量	1	4	8	16
每核 IPS	190.82	538.28	450.94	321.27

下载: 导出CSV

表 4 服务器满载时Verilator仿真的多线程扩展效率对比

Table 4 Comparison of Multi-threading Scaling Efficiency of Verilator Emulation When Server is Fully Loaded

线程数量	4	8	16
每核IPS	297.33	389.27	335.50

下载: 导出CSV

表 5 微结构配置

Table 5 Microarchitectural Configuration

部件	配置
分支预测器	16KB TAGE-SC + ITTAGE + RAS + 4KB BTB
一级数据缓存	128KB, 8路数据缓存
一级指令缓存	128KB, 8路指令缓存
二级缓存	1MB 8路非包含
三级缓存	6MB 6路非包含
一级指令TLB	40项
一级数据TLB	136（128 × 4k页 + 8 × 2M页）
二级TLB	2K项
取指宽度	每周期8×4B指令
译码重命名宽度	每周期6条指令
ROB/LQ/SQ	256/80/64
物理寄存器堆	192整数；192浮点
执行单元	Int: 4×ALU, 2×MDU, 1×Misc Mem: 2×Ld AGU, 2×St AGU Float: 4×FMA, 2×Misc

下载: 导出CSV

表 6 预热配置

Table 6 Warm up Configurations

方案	功能预热的 M条指令数	全细节预热的 M条指令数	性能测量的 M条指令数
0+100		100	5
0+50		50	5
0+25		20	5
0+10		10	5
0+5		5	5
Ada	100−DW	自适应（DW）	5
FixedFW （95+5）	95	5	5

下载: 导出CSV

表 7 不同功能预热方案的总仿真时长对比 h

Table 7 Comparison of Total Simulation Time for Different Functional Warm up Schemes

子项	0+5	0+10	0+25	FixedFW （95+5）	Ada
GemsFDTD	0.37	0.55	1.04	0.42	0.29
astar.bi	0.57	0.91	1.65	0.58	0.64
astar.ri	0.69	0.95	1.97	0.66	0.79
bwaves	0.57	0.92	1.68	0.60	0.43
bzip2.chi	0.30	0.43	0.81	0.30	0.22
bzip2.com	1.00	1.52	2.71	1.01	0.72
bzip2.htm	0.30	0.43	0.92	0.34	0.31
bzip2.lib	0.30	0.42	0.89	0.30	0.21
bzip2.pro	1.01	1.60	3.19	0.98	0.68
bzip2.sou	0.95	1.49	2.92	1.08	0.96
cactusADM	0.41	0.60	1.35	0.47	0.32
calculix	0.35	0.60	1.12	0.36	0.26
dealII	0.33	0.51	1.10	0.40	1.20
gamess.cy	0.33	0.49	1.00	0.36	3.46
gamess.gra	0.35	0.51	1.06	0.38	1.09
gamess.tri	0.33	0.50	0.92	0.34	1.10
gcc.166	0.42	0.61	1.33	0.48	1.34
gcc.200	0.90	1.17	2.72	0.89	0.71
gcc.cpde	0.54	0.86	1.63	0.62	1.75
gcc.expr2	0.58	0.86	1.76	0.63	1.03
gcc.expr	0.63	0.89	1.75	0.61	0.70
gcc.g23	0.55	0.76	1.54	0.66	0.43
gcc.s04	0.57	0.93	1.66	0.67	0.69
gcc.scil	0.90	1.10	2.34	0.94	2.48
gcc.type	0.92	1.44	2.62	0.91	1.57
gobmk.13x	0.94	1.51	3.08	0.99	1.66
gobmk.nn	0.85	1.28	2.61	0.92	0.61
gobmk.sco	0.97	1.34	2.70	0.98	0.66
gobmk.tr	0.95	1.30	2.63	0.87	0.98
gobmk.tr	0.71	1.07	2.26	0.73	1.17
gromacs	0.72	1.00	2.25	0.72	0.48
h264ref.f	0.44	0.58	1.21	0.47	0.45
h264ref.s	0.38	0.50	1.04	0.38	2.23
hmmer.nph	0.77	1.25	2.52	0.85	1.45
hmmer.re	0.80	1.21	2.43	0.92	0.79
lbm	0.67	1.02	2.08	0.74	0.57
leslie3d	0.51	0.78	1.43	0.51	0.35
libquantum	0.56	0.78	1.55	0.98	0.39
mcf	3.14	4.18	9.35	3.34	2.32
milc	0.42	0.59	1.26	0.46	0.34
namd	0.52	0.77	1.38	0.48	0.31
omnetpp	1.08	1.66	3.19	1.27	1.06
perl.che	0.46	0.68	1.29	0.47	0.83
perl.di	0.55	0.83	1.37	0.52	1.56
perl.spli	0.43	0.66	1.31	0.43	0.32
povray	0.55	0.88	1.65	0.54	5.39
sjeng	0.72	1.05	2.00	0.67	2.14
soplex.p	1.15	1.59	3.57	1.36	0.87
soplex.r	1.11	1.70	3.05	1.14	0.71
sphinx3	0.46	0.72	1.33	0.59	1.49
tonto	0.37	0.55	1.19	0.41	0.48
xalancbmk	0.89	1.42	2.56	1.17	1.03
zeusmp	0.51	0.75	1.53	0.58	0.39
总计	35.8	52.7	105.5	38.5	54.4
注：黑体数字表示mcf是25M全细节预热下的时间最长的子项，而povray是Ada配置下的时间最长子项.

下载: 导出CSV

表 8 不同方案准确率对比

Table 8 Accuracy Comparison of Different Schemes %

方案	CPI	分支MPKI	L1MP
Ada	99.6	91.6	95.1
0+50	99.8	98.9	97.5
0+25	99.7	94.1	91.3
0+10	99.1	85.2	82.8

下载: 导出CSV

表 9 WarmProfiler的分支MPKI预测误差（增高）

Table 9 Branch MPKI Prediction Error Caused by WarmProfiler （increase）

子项	完美预测 MPKI	MPKI 增高	MPKI 增高百分比/%
gcc_expr2	0.443	0.177	39.9
gcc_g23	0.973	0.172	17.7
tonto	0.506	0.117	23.1
gamess_g	0.430	0.112	26.1
gcc_scilab	7.687	0.090	1.2
xalancbmk	2.003	0.079	3.9
gcc_s04	0.163	0.070	42.8
perl_di	0.669	0.066	9.8
h264ref_f	0.042	0.064	151.9
astar_rivers	3.422	0.053	1.6
注：计算MPKI误差的方法是用WarmProfiler指导预热所得的MPKI减去用RTL的真实预热需求进行预热所得到的MPKI. 黑体数字标识出了MPKI误差超过0.1的子项.

下载: 导出CSV

表 10 簇的数量对调度均衡度的影响

Table 10 Impact of Cluster Count on Scheduling Balance

调度均衡度	随机调度	LJF调度
4 簇 × 16核	0.93	0.99
8 簇 × 8核	0.76	0.98
16 簇 × 4核	0.54	0.63

下载: 导出CSV

表 11 LJF调度与随机调度的仿真时间对比

Table 11 Comparison of Simulation Time Between LJF Scheduling and Random Scheduling

仿真	随机调度/h	LJF调度/h	提升率/%
Ada，8核×8簇	8.71	6.91	20.61
Ada，8核×16簇	6.25	5.38	13.89
25+5，8核×8簇	15.98	13.54	15.26
25+5，8核×16簇	11.29	9.35	17.23
注：Ada结合LJF调度是HyWarm提出的方案；25+5结合随机调度是基线方案.

下载: 导出CSV

表 12 采用模拟器IPC和RTL的真实IPC指导LJF调度的最大完成时间

Table 12 Maximum Completion Time of LJF Scheduling Guided by Simulator IPC and Real IPC of RTL h

Ada仿真	模拟器预测IPC	真实IPC
8核 × 4 簇	13.77	13.67
8核 × 8 簇	6.91	6.92
8核 × 16 簇	5.38	5.38
注：黑体数字标识出8簇下模拟器预测IPC获得了更短的完成时间，这是因为LJF是贪心算法，完成时间的预测误差可能导致更好的调度结果.

下载: 导出CSV

参考文献(60)

[1]	Bachrach J, Vo H, Richards B, et al. Chisel: Constructing hardware in a scala embedded language[C] //Proc of the 49th Annual Design Automation Conf. New York: ACM, 2012: 1212–1221
[2]	Nikhil R. Bluespec systemVerilog: Efficient, correct RTL from high-level specifications[C] //Proc of the 2nd Int Conf on Formal Methods and Models for Co-Design. Piscataway, NJ: IEEE, 2004: 69–70
[3]	Asanovic K, Avizienis R, Bachrach J, et al. The Rocket Chip Generator[R]. Berkeley, CA: UC Berkeley, 2016
[4]	Xu Yinan, Yu Zihao, Tang Dan, et al. Towards developing high performance RISC-V processors using agile methodology[C] //Proc of the 55th Annual Int Symp on Microarchitecture. Piscataway, NJ: IEEE, 2022: 1178–1199
[5]	Lockhart D, Zibrat G, Batten C. PyMTL: A unified framework for vertically integrated computer architecture research[C] //Proc of the 47th Annual Int Symp on Microarchitecture (MICRO). Los Alamitos, CA: IEEE Computer Society, 2014: 280–292
[6]	Celio C, Chiu P F, Asanović K, et al. Broom: An open-source out-of-order processor with resilient low-voltage operation in 28-nm CMOS[J]. IEEE Micro, 2019, 39(2): 52−60 doi: 10.1109/MM.2019.2897782
[7]	Celio C, Patterson D, Asanovi K. The Berkeley Out-of-Order Machine ( BOOM ) Design Specification[R]. Berkeley, CA: UC Berkeley, 2016
[8]	王凯帆,徐易难,余子濠等. 香山开源高性能 RISC-V 处理器设计与实现[J]. 计算机研究与发展,2023,60(3):476−493 Wang Kaifan, Xu Yinan, Yu Zihao, et al. XiangShan open-source high performance RISC-V processor design and implementation[J]. Journal of Computer Research and Development, 2023, 60(3): 476−493 (in Chinese)
[9]	Veripool. Verilator, the fastest Verilog/SystemVerilog simulator. [EB/OL]. [2022-10-20]. https://www.veripool.org/verilator/
[10]	Sherwood T, Perelman E, Calder B. Basic block distribution analysis to find periodic behavior and simulation points in applications[C] //Proc of the 2001 Int Conf on Parallel Architectures and Compilation Techniques. Los Alamitos, CA: IEEE Computer Society, 2001: 3–14
[11]	Wunderlich R E, Wenisch T F, Falsafi B, et al. SMARTS: Accelerating microarchitecture simulation via rigorous statistical sampling[C] //Proc of the 30th Annual Int Symp on Computer Architecture, ISCA. Los Alamitos, CA: IEEE Computer Society, 2003: 84–95
[12]	Binkert N, Beckmann B, Black G, et al. The gem5 simulator[C] //Proc of the 16th Int Conf on Architectural Support for Programming Languages and Operating Systems.New York: ACM, 2011, 39(2): 1–7
[13]	Kabylkas N, Thorn T, Srinath S, et al. Effective processor verification with logic fuzzer enhanced co-simulation[C] //Proc of the 54th Annual Int Symp on Microarchitecture. New York: ACM, 2021: 667–678
[14]	Eeckhout L, Luo Y, De Bosschere K, et al. BLRL: Accurate and efficient warmup for sampled processor simulation[J]. Computer Journal, 2005, 48(4): 451−459 doi: 10.1093/comjnl/bxh103
[15]	Wenisch T F, Wunderlich R E, Falsafi B, et al. TurboSMARTS: Accurate microarchitecture simulation sampling in minutes[C] //Proc of the Int Conf on Measurements and Modeling of Computer Systems.New York: ACM, 2005: 408–409
[16]	Nikoleris N, Sandberg A, Hagersten E, et al. CoolSim: Statistical techniques to replace cache warming with efficient, virtualized profiling[C] //Proc of the Int Conf on Embedded Computer Systems: Architectures, Modeling and Simulation. Piscataway, NJ: IEEE, 2017: 106–115
[17]	Nikoleris N, Eeckhout L, Hagersten E, et al. Directed statistical warming through time traveling[C] //Proc of the 52nd Annual Int Symp on Microarchitecture. New York: ACM, 2019: 1037–1049
[18]	Patil H, Isaev A, Heirman W, et al. ELFies: executable region checkpoints for performance analysis and simulation[C] // Proc of the Int Symp on Code Generation and Optimization. Piscataway, NJ: IEEE, 2021: 126–136
[19]	Haskins J W, Skadron K. Memory reference reuse latency: accelerated warmup for sampled microarchitecture simulation[C] //Proc of the Int Symp on Performance Analysis of Systems and Software. Los Alamitos, CA: IEEE Computer Society, 2003: 195–203
[20]	Yue Luo, John L K, Eeckhout L. Self-monitored adaptive cache warm-up for microprocessor simulation[C] //Proc of the 16th Symp on Computer Architecture and High Performance Computing. Los Alamitos, CA: IEEE Computer Society, 2004: 10–17
[21]	ARM. Learn the architecture-introducing AMBA CHI[EB/OL]. [2022-11-24]. https://developer.arm.com/documentation/102407/0100
[22]	Cook H, Terpstra W, Lee Y. Diplomatic design patterns: A TileLink case study[C] //Proc of the First Workshop on Computer Architecture Research with RISC-V. Berkeley, CA: UC Berkeley, 2017: 23
[23]	Coffman E G, Sethi R. A generalized bound on LPT sequencing[C] //Proc of the Int Symp on Computer Modeling, Measurement and Evaluation. New York: ACM, 1976: 306–310
[24]	Xiao Xin. A direct proof of the 4/3 bound of LPT scheduling rule[C] //Proc of Int Conf on Frontiers of Manufacturing Science and Measuring Technology. Amsterdam, The Netherlands: Atlantis, 2017: 486–489
[25]	Tan Zhangxi, Waterman A, Cook H, et al. A case for FAME: FPGA architecture model execution[C] //Proc of the 37th Int Symp on Computer Architecture. New York: ACM, 2010: 290–301
[26]	Karandikar S, Mao H, Kim D, et al. FireSim : FPGA-accelerated cycle-exact scale-out system simulation in the public cloud[C] //Proc of the 45th Annual Int Symp on Computer Architecture. Los Alamitos, CA: IEEE Computer Society, 2018: 29-42
[27]	Kim D, Izraelevitz A, Celio C, et al. Strober: Fast and accurate sample-based energy simulation for arbitrary RTL[C] //Proc of the 43rd Int Symp on Computer Architecture. Los Alamitos, CA: IEEE Computer Society, 2016: 128–139
[28]	Hung W N N, Sun R. Challenges in large FPGA-based logic emulation systems[C] //Proc of the Int Symp on Physical Design. New York: ACM, 2018: 26–33
[29]	Agnesina A, Lim S K, Lepercq E, et al. Improving FPGA-based logic emulation systems through machine learning[J].ACM Trans on Design Automation of Electronic Systems, 2020, 25(5): 46:1-46:20
[30]	Cadence. Palladium Emulation [EB/OL]. [2022-12-22]. https://www.cadence.com/en_US/home/tools/system-design-and-verification/emulation-and-prototyping/palladium.html
[31]	Siemens Software. Veloce Hardware-assisted Verification System[EB/OL]. [2023-01-08]. https://eda.sw.siemens.com/en-US/ic/veloce/
[32]	Synopsys. Synopsys Emulation Systems[EB/OL]. [2023-01-08]https://www.synopsys.com/verification/emulation.html
[33]	Beamer S, Donofrio D. Efficiently exploiting low activity factors to accelerate RTL simulation[C] //Proc of the Design Automation Conf. Piscataway, NJ: IEEE, 2020: 1-6
[34]	Sandberg A, Nikoleris N, Carlson T E, et al. Full speed ahead: Detailed architectural simulation at near-native speed[C] //Proc of the Int Symp on Workload Characterization. Los Alamitos, CA: IEEE Computer Society, 2015: 183–192
[35]	Hassani S, Southern G, Renau J. LiveSim: Going live with microarchitecture simulation[C] //Proc of the Int Symp on High-Performance Computer Architecture. Los Alamitos, CA: IEEE Computer Society, 2016: 606–617
[36]	Vengalam U K R, Sharma A, Huang M C. LoopIn: A Loop-Based Simulation Sampling Mechanism[C] //Proc of the Int IEEE Symp on Performance Analysis of Systems and Software. Piscataway, NJ: IEEE, 2022: 224–226
[37]	Carlson T E, Heirman W, Van Craeynest K, et al. BarrierPoint: Sampled simulation of multi-threaded applications[C] //Proc of the Int Symp on Performance Analysis of Systems and Software. Los Alamitos, CA: IEEE Computer Society, 2014: 2–12
[38]	Grass T, Carlson T E, Rico A, et al. Sampled simulation of task-based programs[J]. IEEE Trans on Computers, 2019, 68(2): 255−269 doi: 10.1109/TC.2018.2860012
[39]	Ardestani E K, Renau J. ESESC: A fast multicore simulator using time-based sampling[C] //Proc of the Int Symp on High-Performance Computer Architecture. Los Alamitos, CA: IEEE Computer Society, 2013: 448–459
[40]	Pestel S De, Eyerman S, Eeckhout L. Micro-architecture independent branch behavior characterization[C] //Proc of the Int Symp on Performance Analysis of Systems and Software. Los Alamitos, CA: IEEE Computer Society, 2015: 135–144
[41]	RISC-V International. RISC-V Debug Support Version 1.0.0-STABLE[EB/OL]. [2023-01-26]. https://github.com/riscv/riscv-debug-spec
[42]	Standard Performance Evaluation Corporation. SPEC CPU® 2006[EB/OL]. [2023-01-26]. https://www.spec.org/cpu2006/
[43]	Barr K C, Pan H, Zhang M, et al. Accelerating multiprocessor simulation with a memory timestamp record[C] //Proc of the Int Symp on Performance Analysis of Systems and Software. Los Alamitos, CA: IEEE Computer Society, 2005: 66–77
[44]	Black B, Shen J P. Calibration of microprocessor performance models[J]. Computer, 1998, 31(5): 59−65 doi: 10.1109/2.675637
[45]	Barr K C, Pan H, Zhang M, et al. Accelerating multiprocessor simulation with a memory timestamp record[C] //Proc of the Int Symp on Performance Analysis of Systems and Software. Austin, Texas, USA: IEEE Computer Society, 2005: 66–77.
[46]	Seznec A. A 256 Kbits L-TAGE branch predictor[J]. Journal of Instruction-Level Parallelism Special Issue: The Second Championship Branch Prediction Competition, 2007, 9: 1−6
[47]	Predictors T B, Irisa I. TAGE-SC-L Branch Predictors [J]. 5th JILP Workshop on Computer Architecture Competitions: Championship Branch Prediction, 2016:267175
[48]	Järvelin K, Kekäläinen J. Cumulated gain-based evaluation of IR techniques[J]. ACM Transaction on Information Systems, 2002, 20(4): 422−446 doi: 10.1145/582415.582418
[49]	Khan T A, Brown N, Sriraman A, et al. Twig: Profile-guided BTB prefetching for data center applications[C] //Proc of the 54th Annual Int Symp on Microarchitecture. New York: ACM, 2021: 816–829
[50]	Qureshi M K, Patt Y N. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches[C] //Proc of the 43rd Annual Int Symp on Microarchitecture. Los Alamitos, CA: IEEE Computer Society, 2006: 423–432
[51]	Delimitrou C, Kozyrakis C. IBench: Quantifying interference for datacenter applications[C] //Proc of the Int Symp on Workload Characterization. Los Alamitos, CA: IEEE Computer Society, 2013: 23–33
[52]	Leverich J, Kozyrakis C. Reconciling high server utilization and sub-millisecond quality-of-service[C] //Proc of the European Conf on Computer Systems. New York: ACM, 2014: 1-14
[53]	Muralidhara S P, Subramanian L, Mutlu O, et al. Reducing memory interference in multicore systems via application-aware memory channel partitioning[C] //Proc of the 44th Annual Int Symp on Microarchitecture. New York: ACM, 2011: 374–385
[54]	Kasture H, Sanchez D. Ubik: Efficient cache sharing with strict QoS for latency-critical workloads[C] //Proc of the Int Conf on Architectural Support for Programming Languages and Operating Systems. New York: ACM, 2014: 729–742
[55]	Ma Jiayue, Sui Xiufeng, Sun Ninghui, et al. Supporting differentiated services in computers via programmable architecture for resourcing-on-demand (PARD)[C] //Proc of the Int Conf on Architectural Support for Programming Languages and Operating Systems. New York: ACM, 2015, 50(4): 131–143
[56]	Krause K L, Shen V Y, Schwetman H D. Analysis of several task-scheduling algorithms for a model of multiprogramming computer systems[J]. Journal of the ACM, 1975, 22(4): 522−550 doi: 10.1145/321906.321917
[57]	Hochbaum D S, Shmoys D B. Polynomial approximation scheme for scheduling on uniform processors: Using the dual approximation approach[J]. SIAM Journal on Computing, 1988, 17(3): 539−551 doi: 10.1137/0217033
[58]	Horowitz E, Sahni S. Exact and approximate algorithms for scheduling nonidentical processors[J]. Journal of the ACM, 1976, 23(2): 317−327 doi: 10.1145/321941.321951
[59]	Graham, Ronald L. Bounds for certain multiprocessing anomalies[J]. Bell System Technical Journal, 1966, 45(9): 1563−1581 doi: 10.1002/j.1538-7305.1966.tb01709.x
[60]	Sifive. Block-Inclusivecache-Sifive[EB/OL]. [2023-01-25]. https://github.com/sifive/block-inclusivecache-sifive

施引文献(5)

期刊类型引用(2)

1.	樊青龙，耿磊，邓亚明，梁志斌，张敬文，董刚. 五举煤业智能洗选综合管控平台设计与应用. 选煤技术. 2025(01): 64-74 . 百度学术
2.	陈秀丽. 分布式数据库系统在云计算环境中的数据一致性保障机制. 信息与电脑(理论版). 2024(08): 137-139 . 百度学术