Acceleration Methods for Processor Microarchitecture Design Space Exploration: A Survey
-
摘要:
中央处理器是目前最重要的算力基础设施. 为了最大化收益,架构师在设计处理器微架构时需要权衡性能、功耗、面积等多个目标. 但处理器运行负载的指令多,单个微架构设计点的评估耗时从10 min到数十小时不等. 加之微架构设计空间巨大,全设计空间暴力搜索难以实现. 近些年来许多机器学习辅助的设计空间探索加速方法被提出,以减少需要探索的设计空间或加速设计点的评估,但缺少对加速方法的全面调研和系统分类的综述. 对处理器微架构设计空间探索的加速方法进行系统总结及分类,包含软件设计空间的负载选择、负载指令的部分模拟、设计点选择、模拟工具、性能模型5类加速方法. 对比了各加速方法内文献的异同,覆盖了从软件选择到硬件设计的完整探索流程. 最后对该领域的前沿研究方向进行了总结,并放眼于未来的发展趋势.
Abstract:Central processing unit is the most important computing infrastructure nowadays. To maximize the profit, architects design the processor microarchitecture by trading-off multiple objectives including performance, power, and area. However, because of the tremendous instructions of workloads running on the processors, the evaluation of individual microarchitecture design point costs minutes to hours. Furthermore, the design space of the microarchitecture is huge, which results that the exploration of comprehensive design space is unrealistic. Therefore, many machine-learning-assisted design space exploration acceleration methods are proposed to reduce the size of evaluated design space or accelerate the evaluation of a design point. However, a comprehensive survey summarizing and systematically classifying recent acceleration methods is missing. This survey paper systematically summarizes and classifies the five kinds of acceleration methods for the design space exploration of the processor microarchitecture, including the workload selection of software design space, the partial simulation of workload instructions, the design point selection, the simulation tools, and the performance models. This paper systematically compares the similarities and differences between papers in the acceleration methods, and covers the complete exploration process from the software workload selection to the hardware microarchitecture design. Finally, the research direction is summarized, and the future development trend is discussed.
-
近些年来,流数据在网络安全、智慧城市、气象预测等多个领域大量涌现. 流数据作为一种重要的数据类型,具有持续产生、实时性强、规模巨大且数据分布动态变化等复杂特性,这给流数据挖掘任务带来了极大挑战[1-5]. 概念漂移是指随着时间推移或数据分布发生变化,样本的输入特征和输出标签之间的关系也发生改变的现象[6-9]. 此时集成模型由于没有及时学习到新的数据分布特征从而导致性能会下降.
基于集成学习的方法[10-12]利用历史数据构建基学习器,并借助特定的投票机制(如加权平均、组合投票等)进行集成决策,以此得到比单一基学习器更好的效果,解决了单一基学习器在流数据挖掘中不能把握全局信息的问题,因此利用集成学习处理概念漂移是一种有效可行的手段. 然而,传统集成学习方法在漂移发生后不能对新数据分布及时做出响应,且通常认为历史数据不再适用,如果这些数据中含有对当前模型学习有帮助的样本知识,直接丢弃则会造成已有资源的浪费. 此外,流数据分布变化方式的多样性易产生不同类型的概念漂移(如突变型和渐变型),不同类型漂移的数据分布变化跨度、变化快慢、变化方式等都不相同[13],然而多数在线集成模型只关注单一类型,不能针对漂移类型进行自适应建模.
为解决上述问题,本文提出一种面向不同类型概念漂移的两阶段自适应集成学习方法(two-stage adaptive ensemble learning method for different types of concept drift,TAEL). 该方法从解决不同类型的概念漂移问题入手,检测漂移跨度以确定漂移类型,并构建了针对类型的“过滤-扩充”两阶段样本处理机制. 一方面在样本过滤过程中,根据漂移类型创建非关键样本过滤器,过滤掉历史样本中的非关键因素,保证剩余的历史关键样本块的数据分布更加接近当前数据分布;另一方面在样本扩充过程中,根据漂移类型确定合适的抽样规模,由当前数据块中各个类别的规模占比设置历史关键样本的抽样优先级,并确定抽样概率,按照抽样概率进行分块优先抽样,以扩充当前样本块,为当前样本块补充样本特征的同时缓解了块内类分布不平衡. 本文工作的主要贡献有3方面:
1)通过检测漂移跨度确定概念漂移类型,为不同类型漂移的自适应集成建模提供了一种可行方案;
2)通过对历史数据中非关键样本的过滤,使更新后的历史数据分布更接近最新数据分布,提高了历史基学习器的有效性;
3)通过对当前数据的扩充,缓解了当前基学习器的欠拟合问题,提高了基学习器的稳定性.
1. 相关工作
目前,对含概念漂移的流数据挖掘的处理策略主要包括基于实例选择的方法和基于集成学习的方法. 基于实例选择的方法通常使用滑动窗口技术来实现,其基本思想是将数据流分成固定大小的窗口,通过窗口的向前滑动来实现对概念漂移的检测和处理. ADWIN[14]通过计算子窗口之间的均值差异来判断是否发生了概念漂移. DDM[15]通过持续监视窗口内的数据样本分类错误率来检测概念漂移. STEPD[16]通过比较最近窗口和整个窗口来检测错误率变化. DWCDS[17]提出一种双窗口机制来周期性地检测概念漂移,并对模型进行动态更新以适应概念漂移. CD-TW[18]首先创建2个分别加载历史数据和当前数据的基础节点时序窗口,通过比较二者包含数据的分布变化情况来检测概念漂移. CDT_MSW[19]由单个基本滑动窗口和单个基本静态窗口来检测概念漂移.
使用集成学习处理含概念漂移流数据的研究已经取得了很多成果和进展,基于集成学习的方法大体可分为2类:在线集成和基于数据块的集成.
在线集成是一种对样本进行逐一处理的增量学习方法. 基于单样本的增量模型方法[20]首先初始化一组基分类器,使用每个时间戳下到达的单个样本更新集成模型,然后对基分类器进行加权组合. DOED[21]通过维护低多样性和高多样性的在线加权集成,从而准确地处理各种类型的漂移. 基于混合标记策略的在线主动学习集成框架[22]由一个长期固定分类器和多个动态分类器组成来适应概念漂移. CBCE[23]为每个类维护一个基学习器,并在有新样本时更新基学习器. 在线集成学习方法能够有效提高模型的实时泛化性能,但由于需要逐一处理样本,增加了计算资源,易导致学习效率较低.
基于数据块的集成是一种对固定数量的输入实例进行处理的方法. SEA[24]在连续的数据块上构建基分类器,并且使用启发式替换策略组合成固定大小的集成模型. DWMIL[25],ACDWM[26]为每个数据块创建一个基学习器,通过根据基学习器在当前数据块上的分类性能进行动态加权集成. SRE[27]在基于块的框架中保留一部分先前少数样本以平衡当前块的类分布. DUE[28]为每个数据块创建若干候选分类器,对其进行分段加权,并通过动态调整分类器权重来解决概念漂移问题. SEOA[29]将神经网络的不同层次作为基分类器进行集成,根据各基分类器在当前数据块上的决策损失进行动态加权,以实现稳定性与适应性的平衡. 然而,划分的数据块的大小通常会影响模型的性能和训练速度,因此,选择合适的数据块大小很重要.
与传统方法相比,本文提出的TAEL方法能够充分利用新旧样本信息,根据漂移类型针对性地采用两阶段样本处理机制更新历史样本块和当前样本块,实现了集成模型在概念漂移发生后对新数据分布的快速响应.
2. 面向不同类型概念漂移的两阶段自适应集成
本文提出的TAEL方法的模型总体结构如图1所示. 在漂移类型检测阶段,通过检测漂移跨度span确定漂移类型. 在两阶段自适应集成阶段,首先根据漂移类型创建非关键样本过滤器F,过滤掉历史样本集D上的非关键样本,然后对剩余的历史关键样本ˆD进行分块优先抽样Sampling,根据漂移类型确定合适的抽样规模M,并根据样本所属类在当前样本块的规模占比设置抽样优先级α,由α获得抽样概率P,按照P抽取一定规模的关键样本子集˜D来扩充当前数据集Dt. 在更新后的历史样本集和当前样本集中训练得到具有更高有效性的基学习器,提升了集成模型的实时泛化性能.
2.1 漂移类型检测
流数据是指实时、连续、无限、随时间不断变化的数据序列,时刻t到达的样本由具有联合概率分布 {P_t}({\boldsymbol {x}},y) 的数据源产生. 在流数据挖掘任务中,样本分布的不稳定和动态变化等因素导致流数据中隐含的目标概念发生改变,即概念漂移,其本质可看作流数据的联合概率分布发生变化:
{P_{t - 1}}({\boldsymbol{x}},y) \ne {P_t}({{\boldsymbol {x}}},y) . (1) 为了根据不同类型漂移有针对性地更新集成模型,首先在概念漂移位点处进行漂移类型检测. 本文通过计算span来检测漂移类型. span由漂移开始位点和漂移结束位点间相距的时间跨度确定. 本文判断漂移是否结束的依据是后序数据分布是否已经稳定. 已知漂移开始位点a,选取该位点后序的L个连续数据块 {D_{{{a}} + 1}},{D_{{{a}} + 2}}, … ,{D_{a + L}} ,在这些数据块上训练得到基学习器 {f_{{{a}} + 1}},{f_{{{a}} + 2}}, … ,{f_{a + L}} ,并得到在当前数据块上的实时预测精度 ac{c_{{{a}} + 1}},ac{c_{{{a}} + 2}}, … , ac{c_{a + L}} . 计算实时预测精度的方差:
{s^2} = \frac{{\sum\limits_{l = 1}^L {{{(ac{c_{a + {{l}}}} - \overline {acc} )}^2}} }}{L} , (2) 其中 \overline {acc} 为实时预测精度的平均值. s2反映了L个基学习器的预测差异,同时反映出位点 a 的后序数据分布的稳定程度. 若 {s^2} < \delta ( \delta 为漂移稳定性参数),则认为位点 a + 1 为漂移结束位点, span = 1 ;若s2≥\delta ,则认为漂移仍未结束,接着从位点 a + 1 开始继续上述操作,直到得到漂移结束位点b, span = b - a ;若 span > \theta (θ为漂移类型参数),则判定此次漂移为渐变型,否则判定此次漂移为突变型. 漂移类型检测过程如图2所示.
2.2 “过滤-扩充”两阶段自适应集成学习
为充分利用当前漂移场景的样本信息和有选择地利用历史样本信息以提高集成模型在概念漂移发生后对新数据分布的适应性,本文提出“过滤-扩充”两阶段自适应集成学习方法. 过滤阶段通过过滤非关键样本以帮助历史样本块筛选接近当前数据分布的关键样本,扩充阶段通过向当前数据集补充过滤后保留的历史关键样本以弥补其缺少的样本特征.
2.2.1 样本过滤策略
由于历史样本中含有大量样本信息,而非关键样本会导致该数据集上模型的有效性降低. 为“筛掉”这些无用信息,提高样本质量,使历史数据分布更接近当前数据分布,本文提出一种样本过滤策略,通过创建非关键样本过滤器F过滤掉历史非关键样本. 考虑到不同类型的概念漂移场景下数据分布的变化方式和特点不同,因此需创建不同的F.
假设有历史数据块 D = \{ {D_1},{D_2}, … ,{D_{{n}}}\} ,第i个数据块为 {D_i} = \{ ({{\boldsymbol {x}}_{ij}},{y_{ij}})|j = 1,2, … ,k\} (k为数据块大小),由Di训练得到历史基学习器fi. 当前数据块 {D_t} = \{ ({{\boldsymbol {x}}_{tj}},{y_{tj}})|j = 1,2, … ,k\} ,在Dt上训练得到当前基学习器ft. 候选基学习器池Q用来存储参与集成的候选基学习器,最大容量s=15.
当发生突变型概念漂移时,数据分布急速变化,历史数据分布和当前数据分布差异较大,大量历史样本成为阻碍模型学习的负面因素,导致历史基学习器的性能快速下降. 由于当前基学习器ft在最新数据块Dt上训练得到,反映了流数据的最新分布,因此,为了快速过滤掉历史非关键样本,本文针对这种类型的概念漂移采用一种直接式过滤器,将ft作为每个历史数据块的非关键样本过滤器F,即 F = {f_t} . 以ft对Di的预测观察结果作为样本过滤条件Ci,表达式为:
{C_{{i}}}:{y_{ij}} \ne F({{\boldsymbol {x}}_{ij}}) , (3) 真实标签与ft预测结果不同的样本将被直接过滤掉.
当发生渐变型概念漂移时,数据分布变化较缓慢,历史数据分布与当前数据分布虽有差异但仍相似,历史数据块中可能只有少量样本变得非关键,因此与突变型概念漂移的直接过滤方式不同,渐变型概念漂移采用一种叠加式过滤器,即通过历史数据块的后序基学习器和ft的加权组合来叠加过滤效果,确保充分利用历史样本知识和当前样本知识帮助进行更加准确的过滤操作. 为了实现对样本知识的有效利用,首先需要区分每个基学习器的重要程度,本文将基学习器在Dt上的实时预测精度作为其权重. 在此基础上,Di的叠加过滤器Fi为:
{F_{{i}}} = \sum\limits_{p = i + 1}^n {\frac{{{w_p}}}{{\sum\limits_{q = i + 1}^n {{w_q}} + {w_t}}}} {f_p} + \frac{{{w_t}}}{{\sum\limits_{q = i + 1}^n {{w_q}} + {w_t}}}{f_t} , (4) {w_g} =\frac{1}{k}{{\sum\limits_{j = 1}^k { \llbracket {{f_g}({{\boldsymbol {x}}_{tj}}) = {y_{tj}}} \rrbracket} }} , \quad g =1,2,…,n, (5) {w_t} = \frac{1}{k}{{\sum\limits_{j = 1}^k {\llbracket {{f_t}({{\boldsymbol {x}}_{tj}}) = {y_{tj}}} \rrbracket} }}, (6) 其中当 \left[\kern-0.15em\left[ \cdot \right]\kern-0.15em\right] 中的条件成立时值为1,否则为0. 以Fi对历史样本的预测观察结果作为样本过滤条件,表达式为:
{C_i}:{y_{ij}} \ne {F_{\text{i}}}({{\boldsymbol {x}}_{ij}}) , (7) 真实标签与Fi预测结果不同的样本将被过滤掉.
经过上述操作,符合过滤条件的样本被丢弃,剩下更符合当前数据分布的历史关键样本块 {\hat D_1},{\hat D_2}, … ,{\hat D_{{n}}} . 由于在突变型概念漂移发生后,过滤的样本通常较多,训练样本不足易导致模型训练不充分,因此本文向过滤后的每个历史样本块中补充Dt. 最后,在更新后的历史关键样本块上训练得到 {\hat f_1},{\hat f_2}, … ,{\hat f_{{n}}} ,提高了基学习器的有效性.
2.2.2 样本扩充策略
概念漂移发生后,当前基学习器往往欠拟合,而历史样本恰恰可以帮助当前样本集弥补其缺少的样本知识. 因此,本文提出一种样本扩充策略,将过滤后保留的历史关键样本块 {\hat D_1},{\hat D_2}, … ,{\hat D_{{n}}} 用来扩充Dt. 然而,即使历史样本集已过滤掉部分样本,全部扩充到Dt所花费的时间代价仍较大,为解决这个问题,本文从各个历史数据块中抽取子集 {\widetilde D_1},{\widetilde D_2}, … ,{\widetilde D_n} 用来扩充Dt. 由于扩充后的Dt可能存在类别不平衡,造成这种情况的原因有2种:一种原因是Dt本身就存在类别不平衡的问题,而抽取的样本子集没有改善甚至加重了这种不平衡;另一种原因是Dt本身类分布平衡,但扩充导致了类别不平衡. 因此本文从抽取方式入手,为了降低扩充后的Dt的类不平衡率,提出一种分块优先抽样方法,该方法根据样本所属类在Dt中总类别的规模占比确定抽样优先级α,由此计算得到抽样概率P,按照抽样概率P依次从各个历史关键样本块中不放回地抽取一定数量的关键样本子集用于扩充.
抽样规模的设置直接关系实验结果的好坏. 如果抽样规模太小,将会导致抽样样本不足以提供足够的关键信息;如果抽样规模太大,将会浪费时间和资源,从而降低效率. 由于突变型漂移前后数据分布的差异较大,历史关键样本往往较少,设置总抽样规模M为较小值;渐变型漂移前后数据分布间虽有差异但仍相似,历史关键样本往往较多,设置M为较大值. 因此,可将总抽样规模M和漂移跨度span联系起来,表达式为:
{{M = }}\lambda \times \frac{{span}}{{span + 1}} \times \sum\limits_{i = 1}^n {{z_i}} , (8) 其中λ为样本规模因子,zi为历史数据块 {\hat D_i} 的大小. 在确定M后, {\hat D_i} 的抽样规模Mi由其大小确定,同时为了保证有相对足够的采样样本,限制最小的块抽样规模,表达式为:
{M}_{{i}}\text=\mathrm{max}\left\{\lambda \times \frac{span}{span+1}\times {z}_{i}\text{,}\frac{1}{10n}{{\displaystyle \sum _{j=1}^{n}{z}_{j}}}\right \} . (9) 为了缓解Dt在扩充后的类别不平衡现象,每个样本被抽中的概率与其所属类在Dt中的规模占比密切相关,即越少的类被选中的概率越大,越多的类被选中的概率越小. 因此,为历史样本中类别规模占比较小的样本设置较高的优先级,为类别规模占比较大的样本设置较低的优先级. 如果判断xij所属类别为 {{{c}}'} ,设置其抽样优先级为
{\alpha _{ij}} = \left\{ \begin{gathered} \ln \left(\frac{{\sum\limits_{c \in C} {\sum\limits_{x = 1}^k { \llbracket {{y_{tx}} = c} \rrbracket} } }}{{\sum\limits_{x = 1}^k { \llbracket {{y_{tx}} = {c{'}}} \rrbracket} }}\right),{c'} \in C{\text{且 }}\left| C \right|{\text{ > }}1, \\ \ln \left(\frac{{\sum\limits_{c \in C} {\sum\limits_{x = 1}^k { \llbracket {{y_{tx}} = c} \rrbracket} } }}{2}\right),{c'} \in C{\text{且}}\left| C \right| = 1, \\ \ln \left(\sum\limits_{c \in C} {\sum\limits_{x = 1}^k { \llbracket {{y_{tx}} = c} \rrbracket} } \right),{\text{ 其他}},{\text{ }} \\ \end{gathered} \right. (10) 其中C为当前样本块中出现的样本类别. 抽样优先级和抽样概率成正比,xij的抽样概率可表示为
{P_{ij}} = {{Pr}}(({{\boldsymbol x}_{ij}}{\text{,}}{{{y}}_{ij}}) \in {\widetilde D_i}|({{\boldsymbol x}_{ij}}{\text{,}}{{{y}}_{ij}}) \in {\hat D_i}) = \frac{{{\alpha _{ij}}}}{{\sum\limits_{p = 1}^{z_i} {{\alpha _{ip}}} }} . (11) 显然,当 {\hat D_i} 中每个样本的抽样优先级相等时,有
{P_{ij}} = {{Pr}}(({{\boldsymbol x}_{ij}}{\text{,}}{{{y}}_{ij}}) \in {\widetilde D_i}|({{\boldsymbol x}_{ij}}{\text{,}}{{{y}}_{ij}}) \in {\hat D_i}) = \frac{1}{{{z_i}}} , (12) 分块优先抽样过程变为简单随机抽样. 将历史数据块 {\hat{D}}_{i} 的优先抽样函数表示为
{\widetilde D_{{i}}} = S{{ampling}}({\hat D_i},{M_i},{P_i}) . (13) 依次从 {\hat D_1},{\hat D_2}, … ,{\hat D_{{n}}} 中抽取数量为 {M_1},{M_2}, … ,{M_{{n}}} 的关键样本子集 {\widetilde D_1},{\widetilde D_2}, … ,{\widetilde D_n} ,将关键样本子集扩充到Dt中,得到扩充后的 {\hat D_t} = {\widetilde D_1} \cup {\widetilde D_2} \cup … \cup {\widetilde D_n} \cup {D_t} . 经过上述操作,向 {\hat D_t} 中补充了历史有用信息并且使类分布更加均衡,在扩充后的 {\hat D_t} 上训练得到的 {\hat f_t} 具有更丰富的样本特征,解决了当前基学习器的欠拟合问题,同时提高了基学习器的稳定性. 突变型和渐变型场景下的两阶段自适应集成过程如图3所示.
在将 {\hat f_t} 存储到Q前,需要判断Q是否达到最大容量s. 如果 n \geqslant s ,那么用 {\hat f_t} 替换掉在Dt上实时预测精度最小的历史基学习器:
{\hat f_t} \to \mathop{\arg{\max }}\limits_{{{\hat f}_i} \in Q} \sum\limits_{j = 1}^k { \llbracket {{{\hat f}_i}({{\boldsymbol x}_{tj}}) \ne {{{y}}_{tj}}} \rrbracket} . (14) 最终的强分类器H对于x的预测结果为多数更新后的基学习器预测的结果,即
H({{\boldsymbol{x}}}) = \mathop {\arg \max }\limits_{ y} \sum\limits_{i = 1}^n { \llbracket {{f_i}({\boldsymbol x}) = y} \rrbracket} . (15) 2.3 算法实施流程
TAEL方法首先检测漂移跨度span,判断漂移类型,然后在过滤阶段,设置非关键样本过滤器F,依次对历史样本块进行过滤操作,将剩余的历史关键样本用于训练更新历史基学习器,以提高其有效性;在扩充阶段,采用分块优先抽取策略,根据样本所属类别的规模占比设置抽样优先级,计算得到抽样概率,从历史关键样本块中抽取合适数量的样本子集来扩充当前样本块,缓解了扩充后的类分布不均衡,解决了当前基学习器欠拟合的问题. 算法1展示了TAEL方法的执行流程.
算法1. 面向不同类型概念漂移的两阶段自适应集成算法.
输入:历史数据块 {D_1},{D_2}, … ,{D_{{n}}} ,当前数据块Dt,漂移跨度span,历史基学习器 {f_1},{f_2}, … ,{f_{{n}}} ,当前基学习器ft,非关键样本过滤器F.
输出:更新后的基学习器\hat f_1,\hat f_2,…,\hat f_n, 和\hat f_t .
① 获取Dt上每个类别的样本数 \displaystyle\sum\limits_{x = 1}^k { \llbracket {{y_{tx}} = c} \rrbracket } ;
② if span \leqslant \theta
③ {F_i} = {f_t} ;
④ else
⑤ {F_i} = \displaystyle\sum\limits_{p = i + 1}^n {\frac{{{w_p}}}{{\displaystyle\sum\limits_{q = i + 1}^n {{w_q} + {w_t}} }}{f_p}} + \frac{{{w_t}}}{{\displaystyle\sum\limits_{q = i + 1}^n {{w_q} + {w_t}} }}{f_t} ;
⑥ end if
⑦ for i = 1:n
⑧ for j = 1:k
⑨ if {F_i}({{\boldsymbol {x}}_{ij}}) \ne {y_{ij}}
⑩ 从Di中删除样本xij;
⑪ else
⑫ 根据式(10)计算xij的抽样优先级αij;
⑬ end if
⑭ end for
⑮ 根据式(11)由抽样优先级αi计算抽样概率Pi;
⑯ 得到过滤后的历史关键数据块 {\hat D_i} ;
⑰ if span \leqslant \theta
⑱ 更新历史基学习器 {\hat f_i} \leftarrow train({\hat D_i} \cup {D_t}) ;
⑲ else
⑳ 更新历史基学习器 {\hat f_i} \leftarrow train({\hat D_i}) ;
㉑ end if
㉒ end for
㉓ 获取总抽样规模 M = \lambda \times \dfrac{{span}}{{span + 1}} \times \displaystyle\sum\limits_{i = 1}^n {{z_i}} ;
㉔ for i = 1:n
㉕ 根据式(9)计算每个 {\hat D_i} 上的抽样规模Mi;
㉖ 按照抽样概率Pi从 {\hat D_i} 中抽取大小为Mi的 {\widetilde D_i} = Sampling({\hat D_i},{M_i},{P_i}) ;
㉗ \mathrm{e}\mathrm{n}\mathrm{d}\;\mathrm{ }\mathrm{ }\mathrm{ }\mathrm{ }\mathrm{f}\mathrm{o}\mathrm{r}
㉘ {\hat D_t} = {D_t} \cup {\widetilde D_1} \cup {\widetilde D_2} \cup … \cup {\widetilde D_n} ;
㉙ 更新当前基学习器 {\hat f_t} \leftarrow train({\hat D_t}) ;
㉚ 根据式(15)将最新更新的基学习器参与集成;
㉛ 在Dt+1上进行测试,得到实时精度.
2.4 模型复杂度分析
TAEL的计算成本主要集中在漂移类型检测、样本过滤、样本扩充和基学习器更新这4个阶段,本文将依次对每个阶段进行时间复杂度分析.
1)漂移类型检测. 预测一个数据块中样本的时间复杂度为O(k),其中k为数据块的样本数,那么L个历史基学习器在当前数据块上的预测时间复杂度为O(Lk),计算后序数据分布稳定程度的时间复杂度为O(L). 因此,漂移类型检测过程的时间复杂度为 O(Lk) + O(L) = O(Lk) .
2)样本过滤. 训练一个SVM分类器的时间复杂度为O(p2),其中p为样本数,因此,在大小为k的数据块上训练基学习器的时间复杂度为O(k2). 当发生突变型概念漂移时,使用直接式过滤器,当前基学习器对所有历史数据的预测时间复杂度为O(sk),其中s为基学习器的最大存储容量. 该过程的时间复杂度为O(k2)(一般地, s < k ).
当发生渐变型概念漂移时,采用叠加式过滤器,根据历史和最新基学习器在当前数据块上的预测结果得到权值的时间复杂度为O((s+1)k). 对所有历史数据块依次执行基学习器的加权预测,总的时间复杂度为O(s(s+1)k). 因此,样本过滤过程的时间复杂度为 O({k^2}) + O((s + 1)k) + O(s(s + 1){{k}}) = O({s^2}k) .
3)样本扩充. 计算所有历史数据块的抽样规模Mi的时间复杂度为O(s). 计算每个历史数据块中样本的抽样优先级α的时间复杂度为O(sk),计算每个样本的抽样概率P的时间复杂度为O(sk). 对于每个历史数据块,根据抽样概率P在当前数据块随机抽取Mi个样本的时间复杂度为O(sk). 该过程的时间复杂度为 O(s) + 3O(sk) = O(sk) .
4)更新基学习器. 训练s+1个基学习器的时间复杂度为O((s+1)k2). 替换掉最差基学习器的时间复杂度O((s+1)k),整个过程的时间复杂度为 O((s + 1)({k^2} + k)) = O(s{k^2}) .
3. 实验分析
为验证本文提出的TAEL方法的有效性,本文在具有不同类型概念漂移的标准数据集和真实数据集上进行实验,并从精度、鲁棒性以及收敛性这3个方面进行评价. 实验平台为Windows10操作系统,CPU为酷睿i7-3,2 GHz内核,内存为8 GB,本方法采用MATLAB R2018a编写和运行.
3.1 实验数据
为了检验方法对不同类型概念漂移的处理能力,本文使用大规模在线分析平台MOA[30]中的流数据生成器产生了6个具有突变式、渐进式以及增量式的概念漂移数据集. 除此之外,本文还选取了4个真实数据集. 具体的数据集信息如表1所示.
表 1 数据集信息Table 1. Datasets Information分类 数据集 实例数 维度 类别数量 漂移类型 漂移数量 漂移位点 合成
数据集Sea 100×103 3 2 渐进式 3 25×103,
50×103,
75×103Hyperplane 100×103 10 2 增量式 - - RBFBlips 100×103 20 4 突变式 3 25×103,
50×103,
75×103LED_abrupt 100×103 24 10 突变式 1 50×103 LED_gradual 100×103 24 10 渐进式 3 25×103,
50×103,
75×103Tree 100×103 30 10 突变式 3 25×103,
50×103,
75×103真实
数据集Electricity 45.3×103 6 2 - - - Kddcup99 494×103 41 23 - - - Covertype 581×103 54 7 - - - Weather 95.1×103 9 3 - - - 注:“-”表示未知. 3.2 评价指标
为衡量TAEL方法的性能,本节从模型的精度、鲁棒性及收敛性3方面进行了分析.
1) 平均实时精度(average real-time accuracy,Avgracc)表示模型在每个时间步的实时精度的平均值,反映模型的实时性能.
Avgracc = \frac{1}{T}\sum\limits_{t = 1}^T {\frac{{{n_t}}}{{|{D_t}|}}} , (16) 其中nt代表时间步t内正确分类的样本数,|Dt|表示样本块大小,T表示总的时间步数. 平均实时精度越高说明模型分类性能越好.
2)累积精度(cumulative accuracy,Cumacc)表示模型在当前时刻的累积预测正确样本数和总样本数的比值,反映模型从开始到当前时刻的整体性能.
Cumacc = \frac{{\sum\limits_{i = 1}^{{T_t}} {{n_i}} }}{{\sum\limits_{j = 1}^{{T_t}} {|{D_j}|} }} , (17) 其中Tt表示当前累积的时间步数.
3)鲁棒性(robustness,R)[31]表示模型的稳定性和泛化性能. 本文在平均实时精度上分析了不同方法的鲁棒性,定义为:
R(Dataset) = \frac{{racc(Dataset)}}{{\min racc(Dataset)}} , (18) 其中 {{racc}}({{Dataset}}) 表示某算法在数据集Dataset上的平均实时精度, \min {{racc}}({{Dataset}}) 表示在数据集Dataset上所有算法中的最小平均实时精度.
某算法的整体鲁棒性值为该算法在所有数据集上的鲁棒性的总和. 鲁棒性值越大说明算法越稳定,面对数据中存在的干扰也能保持较好的性能.
4)收敛速度(recovery speed under accuracy,RSA)表示模型从概念漂移位点起实时精度恢复到稳定所需要的时间步数step与收敛位点后K个位点平均错误率avge的乘积:
RSA = step \times avge . (19) 如果一个位点的性能表现和其后续K个参照位点的平均性能表现的差异小于阈值\gamma (当前波动程度较小),同时K个参照位点的前半部分和后半部分的平均性能表现的差异小于 \dfrac{\gamma }{2} (整体波动程度趋近于稳定),那么该位点为收敛位点:
\begin{gathered} \left |ac{c_t} - \frac{{\sum\limits_{j = 1}^K {ac{c_{t + j}}} }}{K}\right| < \gamma {\text{ 且}} \\ \frac{2}{K}\left|\sum\limits_{j = 1}^{\tfrac{K}{2}} {ac{c_{t + j}}} - \sum\limits_{k = \tfrac{K}{2} + 1}^K {ac{c_{t + k}}} \right| < \frac{\gamma }{2}{\text{. }} \\ \end{gathered} (20) 3.3 参数设置
本节对实验模型中的相关参数进行4点讨论:
1)数据块大小k. 过大的数据块中可能包含概念漂移,从而影响模型的分类效果;过小的数据块中可能无法包含足够多的样本特征,从而导致训练的基学习器稳定性较差. 因此,本文统一设置 k = 500 .
2)漂移稳定性参数 \delta 和漂移类型参数 \theta . 考虑到流数据本身的复杂性以及概念漂移类型的多样性,本文设置 \delta = 0.01 , \theta = 1 .
3)样本规模因子 \lambda . 样本规模控制了整体抽样的数量,直接影响了当前基学习器的训练,从而可能会对整体的模型性能造成影响. 因此,本文选取 \lambda \in \{ 0.2,0.4,0.6,0.8\} 进行讨论,得到了在不同 \lambda 下的分类性能,并使用最优样本规模因子与对比方法进行比较.
4)基学习器f. 本文选择LIBSVM来构建“同质”基学习器,核参数采用默认值 g = 1/v ,(v为数据特征维度),惩罚因子设置为 C = 10 .
3.4 实验结果与分析
为评估TAEL的性能,本文选取DWCDS[17],HBP[32],Resnet[33],Highway[34]以及原始深度神经网络(DNN)在精度、鲁棒性和收敛性3个方面进行对比实验和结果分析.
3.4.1 模型精度结果和分析
本节首先分析了在不同样本规模因子 \lambda 下集成模型的表现性能. 表2展示了TAEL方法在不同 \lambda 下的平均实时精度. 从表2可以看出当 \lambda = 0.4 和 \lambda = 0.6 时的平均实时精度值较高,这也反映了 \lambda 会在一定程度上影响当前基学习器的性能,进而影响整个集成模型的实时精度. 分析其原因可能是当 \lambda 取值较大时扩充的历史样本数太多,此时的关键信息冗余,训练得到的基学习器效果较差;当 \lambda 取值较小时扩充的样本数太少,可能丢弃潜在的可用数据,导致训练得到的基学习器处于欠拟合状态. 因此,本文选择适中的扩充规模,又因实验结果中 \lambda = 0.4 时的平均实时精度大于 \lambda = 0.6 时的平均实时精度,最终选择 \lambda = 0.4 的情况下与其他方法进行对比分析.
表 2 不同λ下平均实时精度Table 2. Average Real-Time Accuracy Under Different λ数据集 平均实时精度(排名) λ=0.2 λ=0.4 λ=0.6 λ=0.8 Sea 0.8389 (1) 0.8378 (3) 0.8378 (3) 0.8378 (3) Hyperplane 0.9109 (4) 0.9110 (2.5) 0.9111 (1) 0.9110 (2.5) RBFBlips 0.9549 (2.5) 0.9549 (2.5) 0.9549 (2.5) 0.9549 (2.5) LED_abrupt 0.6228 (3) 0.6229 (2) 0.6229 (2) 0.6229 (2) LED_gradual 0.6205 (3) 0.6226 (1) 0.6199 (4) 0.6213 (2) Tree 0.6671 (1) 0.6669 (2) 0.6660 (3) 0.6656 (4) Electricity 0.7205 (2) 0.7193 (3) 0.7211 (1) 0.7190 (4) Kddcup99 0.9384 (4) 0.9449 (2) 0.9455 (1) 0.9448 (3) Covertype 0.7520 (2) 0.7526 (1) 0.7517 (3.5) 0.7517 (3.5) Weather 0.8969 (3.5) 0.8970 (1.5) 0.8970 (1.5) 0.8969 (3.5) 平均排名 2.6 2.05 2.25 3.0 注:黑体数字表示最高平均实时精度及其排名. 表3展示了不同方法在所有数据集上的平均实时精度及其综合排名. 由表3看出,在合成数据集上,TAEL的实时精度最好;在真实数据集上,TAEL的实时精度排名也都位于前列. TAEL在真实数据集上排名较低的原因可能在于数据集中概念漂移的出现较为密集,而TAEL利用数据块进行处理的方式可能会漏检,导致无法对基学习器进行及时地更新,从而使整个集成模型的性能下降. 在整体排名上TAEL的排名最高,说明了该方法能够提高集成模型的有效性,有较好处理不同类型概念漂移的能力.
表 3 不同方法在各数据集上的平均实时精度Table 3. Average Real-Time Accuracy of Different Methods on Each Dataset数据集 平均实时精度(排名) DWCDS DNN-2 DNN-4 DNN-8 DNN-16 HBP Highway Resnet TAEL Sea 0.7499 (4) 0.7081 (9) 0.7155 (8) 0.7495 (5) 0.7441 (7) 0.7771 (2) 0.7684 (3) 0.7448 (6) 0.8378 (1) Hyperplane 0.6812 (9) 0.8600 (5) 0.8578 (6) 0.8487 (7) 0.7227 (8) 0.8692 (3) 0.8841 (2) 0.8637 (4) 0.9110 (1) RBFBlips 0.8214 (8) 0.8256 (7) 0.8716 (2) 0.8655 (3) 0.4718 (9) 0.8350 (5) 0.8482 (4) 0.8300 (6) 0.9549 (1) LED_abrupt 0.3700 (8) 0.5868 (3) 0.5809 (4) 0.5311 (7) 0.2784 (9) 0.5692 (6) 0.5893 (2) 0.5796 (5) 0.6229 (1) LED_gradual 0.3804 (8) 0.5773 (4) 0.5898 (2) 0.5350 (7) 0.3031 (9) 0.5650 (6) 0.5839 (3) 0.5700 (5) 0.6199 (1) Tree 0.5558 (2) 0.1948 (6) 0.2057 (3) 0.1338 (8) 0.1141 (9) 0.1432 (7) 0.2036 (4) 0.1992 (5) 0.6669 (1) Electricity 0.7346 (1) 0.6228 (6) 0.6231 (5) 0.5635 (8) 0.5154 (9) 0.5676 (7) 0.6317 (4) 0.6343 (3) 0.7193 (2) Kddcup99 0.9829 (1) 0.8796 (3) 0.7186 (6) 0.4763 (8) 0.3017 (9) 0.7670 (4) 0.7537 (5) 0.6535 (7) 0.9449 (2) Covertype 0.8486 (1) 0.5251 (9) 0.5739 (8) 0.6243 (6) 0.6269 (5) 0.6465 (3) 0.6354 (4) 0.6183 (7) 0.7526 (2) Weather 0.9566 (1) 0.8478 (3) 0.8050 (6) 0.8057 (5) 0.8043 (7) 0.8139 (4) 0.7813 (9) 0.8034 (8) 0.8970 (2) 平均排名 4.30 5.50 5.00 6.40 8.10 4.70 4.00 5.60 1.40 注:黑体数字表示最高平均实时精度及其排名. 图4为TAEL和各个对比方法在所有数据集上的累积精度,表4为TAEL和各个对比方法的最终累积精度和综合排名. 由图4和表4可知,在标准数据集上TAEL的累积精度最高,在真实数据集上TAEL的累积精度也有较好的排名,分析其原因是该方法针对漂移类型对数据块逐一处理的策略能够使模型对不同类型的概念漂移做出及时响应,保持较高的精度.
表 4 不同方法在各数据集上的最终累积精度Table 4. Final Cumulative Accuracy of Different Methods on Each Dataset数据集 最终累积精度(排名) DWCDS DNN-2 DNN-4 DNN-8 DNN-16 HBP Highway Resnet TAEL Sea 0.7500(8) 0.7495(9) 0.7543(7) 0.7861(4) 0.7820(5) 0.8083(2) 0.7977(3) 0.7803(6) 0.8370(1) Hyperplane 0.6763(9) 0.8600(5) 0.8580(6) 0.8483(7) 0.7230(8) 0.8691(3) 0.8840(2) 0.8636(4) 0.9110(1) RBFBlips 0.8231(8) 0.8345(7) 0.8828(2) 0.8708(3) 0.5379(9) 0.8476(5) 0.8586(4) 0.8374(6) 0.9481(1) LED_abrupt 0.3681(8) 0.5869(3) 0.5803(4) 0.5305(7) 0.2786(9) 0.5693(6) 0.5893(2) 0.5796(5) 0.6229(1) LED_gradual 0.3821(8) 0.5776(4) 0.5898(2) 0.5344(7) 0.3032(9) 0.5650(6) 0.5843(3) 0.5699(5) 0.6199(1) Tree 0.5558(2) 0.4329(5) 0.4575(3) 0.3330(8) 0.3033(9) 0.3591(7) 0.4472(4) 0.4310(6) 0.6636(1) Electricity 0.7404(1) 0.6434(6) 0.6450(4) 0.5840(8) 0.5735(9) 0.5969(7) 0.6447(5) 0.6502(3) 0.6674(2) Kddcup99 0.9833(1) 0.9832(2) 0.9195(7) 0.7813(8) 0.6160(9) 0.9823(3) 0.9614(4) 0.9276(6) 0.9562(5) Covertype 0.8481(1) 0.6983(9) 0.7336(8) 0.7676(6) 0.7685(5) 0.7919(2) 0.7823(3) 0.7709(4) 0.7463(7) Weather 0.9571(1) 0.8872(3) 0.8743(6) 0.8754(5) 0.8664(8) 0.8824(4) 0.8362(9) 0.8708(7) 0.8933(2) 平均排名 4.70 5.30 4.90 6.30 8.00 4.50 3.90 5.20 2.20 注:黑体数字表示最高的最终累积精度及其排名. 本文使用非参数检验方法Friedman-Test[35]对TAEL与对比方法相比较的性能优势进行统计检验. 对于给定的K(K=9)种方法和N(N=10)个数据集,令 r_i^j 为第j个方法在第i个数据集上的秩,则第j个算法的秩和平均为
{R_j} = \frac{1}{N}\sum\limits_{i = 1}^N {r_i^j} . (21) 零假设H0假定所有方法的性能是相同的. 在此前提下,当N与 K 足够大时,Friedman统计值 {\tau _F} 服从第一自由度为 K - 1 、第二自由度为 (K - 1)(N - 1) 的F分布:
{\tau _F} = \frac{{(N - 1){\tau _{{\chi ^2}}}}}{{N(K - 1) - {\tau _{{\chi ^2}}}}}, \qquad\qquad\qquad\quad (22) {\tau _{{\chi ^2}}} = \frac{{12N}}{{K(K + 1)}}\left [\sum\limits_{j = 1}^K {R_j^2} - \frac{{K{{(K + 1)}^2}}}{4}\right]. 若计算得到的统计值大于某一显著性水平下F分布临界值,则拒绝零假设H0,表明各方法的秩和存在显著差异,即测试方法性能存在显著差异;反之则接受零假设H0,所有方法的性能没有明显差异.
在 \alpha = 0.05 的情况下F分布临界值 \tau _F^{0.05}(8,72) = 2.069\;8 ,经计算可得在不同性能指标下的Friedman统计值 {\tau _F} ,如表5所示. 从表5可以看出,平均实时精度和最终累积精度下的 {\tau _F} 统计值均大于临界值 \tau _F^{0.05}(8,72) ,拒绝零假设 {H}_{0} ,说明所有方法性能存在显著差异.
表 5 平均实时精度和最终累积精度下的 {\tau _F}Table 5. {\tau _F} of Average Real-Time Accuracy and Final Cumulative Accuracy评价指标 {\tau _F} \tau _F^{0.05}(8,72) 平均实时精度 7.2260 2.0698 最终累积精度 4.5747 本文用Bonferroni-Dunn测试[36]计算了所有方法的显著性差异,用于比较2种方法之间是否存在显著差异. 若2种方法的秩和平均差值大于临界差,则这2种方法的性能存在显著差异:
CD = {q_\alpha }\sqrt {\frac{{K(K + 1)}}{{6N}}} , (23) 其中当 K = 9 , N = 10 时,可以查表得到 {q_{\alpha = 0.05}} = 2.724 ,经计算得到显著性水平 \alpha = 0.05 的情况 CD = 3.336\;2 . 不同方法在平均实时精度和最终累积精度上的统计分析结果如图5所示,在图中将没有显著性差异的方法使用黑线连接起来. 结果表明,在统计意义上,TAEL方法排名最好且具有明显的优势.
3.4.2 模型鲁棒性分析
为了衡量各个方法的算法稳定性,本节计算每个方法在各个数据集上的鲁棒性,图6展示了计算结果. 图6中每个小矩形的面积代表的是算法在某种数据集上的鲁棒性值的大小,每一列上展示的数值代表算法在所有数据集上的鲁棒性值总和,即该算法的整体鲁棒性. 由图6可知,在大多数情况下,TAEL的鲁棒性都能取得较好的排名,且整体鲁棒性最高,这说明该方法对数据的噪声和异常值具有更强的鲁棒性,能提高集成模型的整体泛化性能.
3.4.3 模型收敛性分析
为比较各个方法在概念漂移发生后的收敛性能,本节计算并分析了各个方法在5个合成数据集的概念漂移位点上的收敛速度. 在收敛位点的判定过程中,设定收敛判定阈值 \gamma = 0.02 ,参照位点个数 K = 20 . 表6展示的为各个方法在数据集上的已知漂移位点上计算得到的收敛速度. 由于个别方法在漂移位点处精度保持平稳波动,因此,对该位点的收敛速度不做统计,用“-”进行表示. 从表6可以看出,TAEL在多数情况下都具有较快的收敛速度,是因为该方法及时更新基学习器使其能尽快适应新的数据分布,集成有效性得到提高. 在整体排名中TAEL处于第一,说明该方法具有较快的收敛速度,收敛性能较好.
表 6 不同方法在各数据集上的收敛速度Table 6. Recovery Speed Under Accuracy of Different Methods on Each Dataset数据集 DWCDS DNN-2 DNN-4 DNN-8 DNN-16 Sea 0.67/0.25/0.45 0.97/2.63/0.70 1.24/1.15/2.18 2.36/1.55/2.78 2.60/1.66/0.20 RBFBlips 0.56/0.85/0.16 1.17/1.56/0.41 0.38/0.61/0.22 0.56/0.94/0.21 -/-/- LED_abrupt 3.70 9.77 14.92 19.85 17.94 LED_gradual 2.20/1.95/- 10.70/7.43/6.01 10.91/7.89/3.58 14.24/11.48/3.70 14.83/9.61/4.55 Tree 9.29/4.69/4.70 -/23.81/0.87 2.62/15.22/7.69 4.44/0.88/0.88 0.89/0.88/0.88 平均排名 3.54 5.85 4.85 5.54 5.77 数据集 HBP Highway Resnet TAEL Sea 0.77/0.51/1.82 2.76/0.49/1.81 0.57/0.56/0.73 0.45/1.97/1.50 RBFBlips 1.14/0.83/0.21 0.85/1.14/0.42 1.01/2.02/0.62 0.03/0.29/0.01 LED_abrupt 20.20 9.80 11.65 5.28 LED_gradual 13.12/10.86/7.42 9.66/6.68/5.37 12.80/7.06/5.47 8.25/5.34/3.71 Tree 3.56/0.87/1.75 -/17.56/1.75 -/21.49/1.75 3.85/3.92/3.85 平均排名 5.38 5.23 5.69 3.15 注:“-”表示对当前位点的收敛速度不进行统计;黑体数字表示最高收敛速度;LED_abrupt包含1个漂移位点,收敛速度只有1个;其他数据集包含3个漂移位点,对应3个收敛速度. 4. 结束语
针对概念漂移发生后,在线集成模型无法及时响应数据流的变化而导致泛化性能降低、收敛速度减慢的问题,本文提出一种面向不同类型概念漂移的两阶段自适应集成学习方法. 本文通过检测漂移跨度来确定漂移类型,并采用一种针对漂移类型进行自适应调整的两阶段样本处理机制. 在该机制中,一方面通过样本过滤策略过滤历史样本块中的非关键样本,使历史数据分布更接近当前最新数据分布,提高了基学习器的有效性;另一方面通过样本扩充策略为当前样本集补充合适数量的历史关键样本,解决了当前基学习器的欠拟合问题,同时缓解了扩充后的类别不平衡. 更新后的基学习器组成的集成模型的有效性得到了提高,对不同类型的概念漂移能做出更精准及时的响应. 在集成学习中,集成的多样性同样影响了集成模型的性能,在未来的工作中,将进一步研究针对不同漂移类型提升集成多样性的方法.
作者贡献声明:郭虎升负责思想提出、方法设计、论文写作及修改;张洋负责论文写作、代码实现、数据测试及论文修改;王文剑负责写作指导、修改审定.
-
表 1 处理器微架构设计空间探索的加速方法分类
Table 1 Category of Acceleration Methods for Processor Microarchitecture Design Space Exploration
类型 子类型 典型方法 负载选择 基于微架构相关特征的方法 文献[15, 27] 基于微架构无关特征的方法 MinneSPEC[28]、文献[29–32]、BenchSubset[33]、CASH[34] 基于微架构相关与无关特征的方法 文献[29, 35–37]、BenchPrime[38] 部分模拟 统计采样模拟 采样单线程[39-43]、采样多线程[44-48]、采样访存[49-51] 综合模拟 综合单线程[52-55]、综合多线程[56-58]、综合访存[59-62] 设计点选择 采样方法 基于参数敏感度的方法[15,63-67]、基于实验设计的方法[6,25,34,67] 迭代搜索方法 启发式方法[68-70]、组合优化方法[68,71-74]、统计推理方法[14,25-26,67,75] 模拟工具 软件模拟 SimpleScalar[76],SESC[77],gem5[78] 硬件模拟 FAST[79], PROTOFLEX[80-81], RAMP Gold[82], HAsim[83], FireSim[84] 敏捷开发 基于低级语言的平台[85-87]、基于高级语言的平台[50,88-89] 性能模型 特定负载预测模型 参数化模型[2,4,90]、核函数模型[13,68,91]、神经网络模型[3,92-93]、树模型[94-96]、集成学习模型[67,97-98] 跨负载预测模型 基于负载特征[8,99-100]、基于硬件响应[9,23,101]、基于迁移学习[7,10-11] 机械模型 分析模型[102-103]、区间模型[104-106]、图模型[107-109]、概率统计模型[110-112]、混合模型[113-115] 表 2 加速方法对比
Table 2 Comparison of Acceleration Methods
表 3 负载选择方法的对比
Table 3 Comparison of Workload Selecting Methods
方法类型 方法来源 使用微架构相关特征的方式 使用微架构无关特征的方式 聚类算法 负载选择比 误差/% 基于微架构相关
特征的方法文献[15] 参数显著性排名 ✘ 阈值聚类 7/12 - 文献[27] 执行时间向量 ✘ 层次聚类 6/11 5 基于微架构无关
特征的方法文献[28] ✘ 卡方检验 ✘ -/23 - 文献[29] ✘ 主成分分析 层次聚类 7/79 - 文献[30] ✘ 基本块向量 距离最大 60/20 000 - 文献[31] ✘ 主成分分析 k均值聚类 9/21 15 文献[32] ✘ 基本块向量+主成分分析 层次聚类 4/47 - 文献[33] ✘ 分组主成分分析 共识聚类 - - 文献[34] ✘ 独立成分分析 多种聚类 5/27 3 文献[120] ✘ 主成分分析/遗传算法 k质心聚类 50/118 5 文献[121−122] ✘ 凸壳体积、主成分分析 遗传算法 6/22 - 基于微架构相关与
无关特征的方法文献[29,35] 主成分分析 层次聚类 14/29 - 文献[36] 主成分分析 层次聚类 10/23 - 文献[37] 主成分分析 层次聚类 12/43 7 文献[123] 多元因素分析 层次聚类 10/23 - 文献[38] 主成分分析+线性判别 多种聚类 20/54 - 注:“负载选择比”列中的“/”表示选择的负载数量和全部负载数量之比,“-”表示文献中无数据. “✘”表示无该项. 表 4 常用基准套件汇总
Table 4 Summary of Common Benchmark Suits
类型 工作负载 简称 多媒体和通信 MediaBench[124] MediaBench 嵌入式 MiBench[125] MiBench 单线程 SPEC CPU 2000[126] SPEC2k 单线程 SPEC CPU 2006[127] SPEC2k6 单/多线程 SPEC CPU 2017[18] SPEC2k17 多线程 Princeton Application Repository for
Shared-Memory Computers[128]PARSEC 多线程 Stanford Parallel Applications for
Shared Memory[129]SPLASH 表 5 微架构相关特征
Table 5 Microarchitecture-Dependent Features
类型 特征 整体聚合 执行时间、CPI、功率 控制流 分支预测MPKI、BTB命中率 cache行为(Icache/Dcache/L2/L3) 访问数量、命中数量、MPKI TLB行为(ITLB/DTLB/L2TLB) 访问数量、命中数量、MPKI 注:MPKI表示每千条指令缺失. 表 6 微架构无关特征
Table 6 Microarchitecture-Independent Features
类型 子类型 特征 指令流 指令混合 整型、浮点、SIMD等 控制分支 存储读/写 寄存器通信 平均操作数数量 平均使用次数 重用距离 指令级并行性 不同窗口大小的并行度 基本块大小 指令局部性 指令工作集大小 时间、空间重用距离 数据流 数据局部性 数据工作集大小 时间、空间重用距离 通信特征 私有数据读写次数 生产者写/消费者读次数 注:SIMD表示单指令多数据流. 表 7 部分模拟加速方法的对比
Table 7 Comparison of Partial Simulation Acceleration Methods
类型 目标 子类型 方法来源 指令流 数据流 微架构相关特征 加速比 误差/% 统计采样模拟 采样单线程 随机采样 文献[134] ✘ ✘ ✘ - 7~17 均匀采样 文献[39] ✘ ✘ ✘ 35~60 0.6 文献[40] ✘ ✘ ✘ ~4 000 3.5 代表性采样 文献[41−42,135−136] ✔ ✘ ✘ 62~107 3.7 文献[43] ✔ ✘ ✘ 1~1.4 0.5 文献[137] ✔ ✘ IPC, cache ~100 3 文献[138] ✔ ✘ IPC, cache - 2~8 采样多线程 基于时间 文献[44] ✔ ✘ IPC 10 5 文献[139] ✔ ✘ IPC 5.8 3.5 文献[45] ✔ ✘ IPC 20 5.3 基于负载和特定同步 文献[47] ✔ ✘ ✘ 25 0.9 文献[48] ✔ ✘ ✘ 220 0.5 基于循环迭代 文献[46] ✔ ✘ ✘ 801 2.3 采样访存 基于检查点 文献[49] ✘ ✔ cache, BP 8 000~15 000 ~0.6 文献[51] ✘ ✔ cache, BP 50~100 ~0.6 文献[50,140−141] ✘ ✔ cache, BP - - 基于预热 文献[142] ✔ ✔ cache, BP 8 000~15 000 ~0.6 文献[143] ✔ ✔ cache, BP ~100 1.5 文献[144] ✔ ✔ cache, BP ~70 0.3 文献[145−147] ✔ ✔ cache, BP - - 综合模拟 综合单线程 文献[54] ✔ ✘ cache, BP - 5~7 文献[148] ✔ ✘ cache, BP - 4.1 文献[149] ✔ ✘ cache, BP - - 文献[52−53] ✔ ✘ cache, BP - 8 文献[150] ✔ ✘ cache, BP ~1 000 6.6 文献[55,151] ✔ ✔ cache, BP ~1 000 2.4 文献[116] ✔ ✔ cache, BP 520 5.1 文献[152] ✔ ✔ cache, BP - 3.2 文献[153−155] ✔ ✔ ✘ - - 综合多线程 文献[58] ✔ ✔ ✘ 9~385 3.8~9.8 文献[156] ✔ ✔ ✘ 1 000~10 000 4.9 文献[56−57] ✔ ✔ cache, BP 40~70 5.5 文献[157] ✔ ✔ cache, BP 21 8 综合访存 文献[59] ✘ ✔ ✘ - 0.4~3.1 文献[60−61] ✘ ✔ ✘ - - 文献[158−160] ✔ ✔ ✘ 31 4.8 文献[161] ✔ ✔ ✘ 20 2.8 文献[62] ✔ ✔ ✘ 20~50 4.2 文献[162] ✔ ✔ ✘ - 9 注:“-”表示文献无该数据. “✔”表示有使用该类数据,“✘”表示没有使用该类数据. 表 8 实验设计的对比
Table 8 Comparison of Design of Experiments
表 9 迭代搜索加速方法的对比
Table 9 Comparison of Iterative Searching Acceleration Methods
类型 子类型 方法来源 代理模型 搜索/获取函数 硬件设计空间 启发式 文献[174] - 参数聚类、贪心 单核片上系统 文献[64] - 敏感度、贪心 cache微架构 文献[66] - 敏感度、贪心 FPGA软核 文献[175] - 二进制搜索树 VLIW 文献[176] - 贪心、单目标化 CMP 组合优化 遗传算法 文献[71] - GA 单核片上系统 文献[72] 2层次模拟 局部搜索+GA 单核CPU 文献[73,177] 模糊系统 GA VLIW 文献[171] 多项式回归 GA 单核CPU 文献[117] - 爬山/GA/蚁群 CMP 文献[74] ANN预测级别 NSGA-II CMP 文献[69] ANN NSGA-II CMP 文献[178] ANN NSGA-II VLIW 文献[68] ACOSSO NSGA-II CMP 模拟退火 文献[178] ANN预测级别 模拟退火 VLIW 文献[179] 多种模型之一[25] 多种搜索算法 CMP 统计推理 不确定度 文献[67,97] AdaBoost.ANN CoV 单核CPU 文献[172−173] XGBoost 距离的最小值 单核CPU 预期改善 文献[75,180] 克里金模型预测级别 EI(+GA) CMP 文献[34] 随机深林 EI CMP 超体积改善 文献[13] ACOSSO EHVI CMP 文献[14] 高斯过程 EHVI 单核CPU 文献[181] AdaGBRT HVI+均匀性 单核CPU 文献[182] BagGBRT HVI+UCB 单核CPU 帕累托 文献[25] 多种模型之一 候选帕累托最优解集 CMP 文献[183−184] 马尔可夫决策 帕累托覆盖 CMP 文献[26,168] 马尔可夫网预测分布 帕累托最优解集 CMP 注:“-”表示该方法只以软件模拟或基于RTL的电路评估的方式获取性能指标,其余方法可通过训练代理模型替代软件模拟来获取指标或指标之间的关系. 表 10 模拟工具的对比
Table 10 Comparison of Simulation Tools
类型 准确率 模拟速度 灵活性 开发难度 软件模拟 低 中(~10 MHz) 高 低 硬件模拟 中 快(~100 MHz) 低 中 敏捷设计 高 慢(1~5 kHz) 中 高 表 11 硬件模拟平台的对比
Table 11 Comparison of Hardware Simulation Platforms
表 12 敏捷开发平台的对比
Table 12 Comparison of Agile Development Platforms
语言类型 平台 设计语言 指令集 年份 低级语言 OpenPiton[85] Verilog HDL SPARCv9 2016 LiveHD[86] Verilog HDL RISC-V 2020 BlackParrot [87] SystemVerilog RISC-V 2020 高级语言 CMD[88] BlueSpec RISC-V 2018 Agile[197] Chisel RISC-V 2016 Chipyard[89] Chisel RISC-V 2020 MINJIE[50] Chisel RISC-V 2022 语言模型 llvm-mca[198] - - 2018 Ithemal[199] - CISC 2019 Chip-Chat[200] 自然语言 - 2023 ChipGPT[201] 自然语言 RISC 2023 RTLLM[202] 自然语言 RISC 2023 注:“-”表示文献无该项. 表 13 性能模型的对比
Table 13 Comparison of Performance Models
类型 准确性 复杂度 可解释性 预测模型 低 低 低 机械模型 高 高 高 表 14 性能预测模型的对比
Table 14 Comparison of Performance Prediction Models
类型 准确性 复杂度 可解释性 参数化 低 低 高 核函数 中 中 低 神经网络 中 高 低 树模型 中 中 高 集成学习 高 高 中 表 15 特定负载预测模型的对比
Table 15 Comparison of Workload-Specific Prediction Models
类型 预测模型 硬件设计空间 预测指标 负载 误差/% R2 采样/设计空间 参数化 线性回归[90] 单核 CPI MinnerSPEC 0.8 - 200/67×106 受限三次样条回归[2,4] 单核、异构核 CPI, E, P SPEC2k 4.9 - 4×103/22×109 三次样条回归模型[5] 单核、多核 T 18项负载 1.4 - 300/4.3×109 埃尔米特多项式插值[210] PHT, cache E SPEC2k, MediaBench - - 243/19×103 核函数 支持向量机[170] 单核 T, E SPEC2k 0.5 - 12/ 4608 内核典型相关分析[211] 多核 T, E ENePBench 6.2 0.88 450/2.8×106 ACOSSO[68] 单核、多核 T, E, P SPEC2k, SPLASH-2 - - 450/128×103 ACOSSO[13] 多核 T, E, P SPLASH-2 - - 100/332×103 高斯过程[91] 核数 T SPLASH-3, PARSEC-3 - 0.82 67/68 高斯过程[14] 单核 T, E, P 27项负载 - - 14/994 神经网络 径向基函数网络[92] 单核 CPI MinnerSPEC 2.8 - 200/512 小波神经网络[93] 单核 CPI, E, P SPEC2k - - 1024 /246×103神经网络[3,209,212-213] 单核、多核 CPI MinneSPEC等 2.3 - 221/23×103 神经网络+遗传算法[214] 单核 CPI SPEC2k 3.3 230/23×103 树模型 模型树[94] 性能计数器 CPI SPEC2k6 7.8 0.98 - 模型树[95] 单核 T, E 图像压缩负载 1.3 0.95 3211 /3288 决策树[138] 性能计数器 CPI SPEC2k6,SysMark07等 2 - - 决策树[96] 异构核 T, E SD-VBS, MiBench 2.1 - 664/830 集成学习 自适应提升+神经网络[67,97] 单核 CPI SPEC2k6 - - 264/8.4×106 梯度提升回归树[169] 单核、多核 T SPEC2k, SPLASH-2 1.1 - 3×103/15×106 XGBoost[172] 单核 E riscv-tests 3.4 0.99 1120 /1200 提升法+梯度提升回归树[181] 单核 CPI, E, P SPEC2k17 - - 100/2×103 装袋法+模型树[98] 单核 CPI, E SPEC2k - - 320/71×106 装袋法+梯度提升回归树[182] 单核 CPI, E, P SPEC2k17 - - 100/37×103 堆叠法+决策树[22] 单核、多核 T, E SPEC2k6,SPLASH-2 - - 100/605×103 堆叠法+异类模型[118] 单核 CPI, E SPEC2k 1.8 - 3×103/2.5×109 注:硬件设计空间中单核主要包括单核处理器微架构,多核指基于总线或片上网络的同构多核处理器. “T”指时间,“E”指功率,“P”指对多个性能指标探索帕累托最优解集,误差以CPI的百分比绝对误差衡量(越接近0越好),R2为相关系数(越接近1越好),“-”表示该工作无显式标注数据. 表 16 跨负载预测模型工作的对比
Table 16 Comparison of Cross-Workload Prediction Model Work
类型 来源 预测模型 跨负载方法核心 性能指标 设计点数 误差/% R2 负载特征 文献[99] 归一化、PCA+GA、线性回归 负载特征、平均相似负载的结果 时间 35×25+0 - - 文献[8] 多项式回归+遗传算法 负载特征 CPI 360×7+0 8~10 >0.90 文献[100] 模型树 负载特征 CPI、功率 500×25+0 - 0.90 文献[34] 多种模型之一 负载特征、最近邻归类模型 CPI、功率 3 000×10+0 - 0.98 硬件响应 文献[9] 神经网络 模型本身泛化 时间 639×27+50 - - 文献[23] 矩阵补全算法 模型本身泛化 CPI、功率 128×20+20 10.0 - 文献[101] 线性回归 响应边际关系、最近邻归类 CPI 60×23+600 6.3 0.92 文献[215] 神经网络 响应签名、模型本身泛化 时间、EDP 1000 ×8+04.2 - 迁移学习 文献[11] 神经网络 线性回归 CPI、功率 512×5+32 7.0 0.95 文献[10,216] 神经网络 贪心选择负载、线性回归 CPI、功率 512×5+32 3.0 - 文献[7] 模型树+自适应提升 负载聚类、样本迁移TrAdaBoost CPI 10×5+10 7.0 0.91 文献[6] 神经网络+自适应提升 支持向量机 CPI 128×3+40 5.5 0.93 注:“-”表示文献无该数据. “设计点数”列中的表达形式为源样本数量×源负载数量+目标样本数量. 表 17 跨负载预测模型的对比
Table 17 Comparison of Cross-Workload Prediction Models
类型 核心 准确性 复杂度 负载特征 特征空间的相似性 低 低 硬件响应 硬件响应作为新维度 中 中 迁移学习 元模型的知识迁移 高 高 表 18 机械模型工作的对比
Table 18 Comparison of Mechanism Model Work
类型 来源 目标架构 组件 预测指标 仅微架构无关特征 误差/% 速率/(MIPS/核) 分析模型 文献[219] cache cache cache缺失 ✘ - - 文献[102] 乱序、单核 指令窗口、BP、cache IPC ✘ 5.5 100 文献[103] cache、多核 cache IPC ✘ 1.57 - 文献[220] cache cache 功率、面积 ✘ 5 - 文献[221] 按/乱序、多核 BP、cache、NoC等 功率、面积 ✘ 11~23 - 区间模型 文献[222] 乱序、单核 BP, cache IPC ✘ 5.8 - 文献[104] 乱序、单核 BP, cache IPC ✘ 7 - 文献[223] 乱序、多核 BP, cache IPC ✘ 4.6 ~1 文献[105,224] 按序、单核 指令依赖、BP, cache CPI、功率 ✘ 2.5 ~6 文献[225] 乱序、单核 BP, cache IPC、功率 ✔ 9.3 1.9 文献[226] 乱序、多核 BP, cache CPI ✔ 11.2 - 文献[227] 乱序、单核 SIMD、cache、带宽 CPI、功率 ✔ 25 - 文献[228] 乱序、多核 SIMD、cache、带宽 时间、功率 ✔ 36 - 图模型 文献[107] 乱序、单核 BP, cache CPI ✘ - - 文献[108] 乱序、单核 BP, cache CPI ✘ - - 文献[109] 乱序、多核 BP, cache, NoC IPC ✘ 7.2 ~12 概率统计模型 文献[110] 乱序、单核 BP, cache IPC ✘ 2~10 - 文献[111] 乱序、多核 BP, cache IPC ✘ 7.9 ~9 文献[112] cache cache cache缺失 ✘ 0.2 - 混合模型 文献[113] 乱序、单核 流水线深度 IPC ✘ - - 文献[114] 乱序、单核 cache、MSHR、预取 CPI、cache缺失 ✘ 9.4 ~15 文献[115] 乱序、单核 执行单元 CPI ✘ 5.6 15.1 注:“-”表示文献无该数据. 表 19 机械模型的对比
Table 19 Comparison of Mechanism Models
类型 核心思想 准确性 复杂度 分析模型 数学公式 低 低 区间模型 事件分隔的区间 中 高 图模型 依赖图的关键路径 中 中 概率统计模型 事件发生的概率 中 中 混合模型 分析模型+预测模型 中 低 -
[1] Azizi O, Mahesri A, Lee B C, et al. Energy-performance tradeoffs in processor architecture and circuit design: A marginal cost analysis[C]//Proc of the 27th Annual Int Symp on Computer Architecture. New York: ACM, 2010: 26–36
[2] Lee B C, Brooks D M. Illustrative design space studies with microarchitectural regression models[C]//Proc of the 13th Int Conf on High-Performance Computer Architecture. Los Alamitos, CA: IEEE Computer Society, 2007: 340–351
[3] Ipek E, McKee S A, Caruana R, et al. Efficiently exploring architectural design spaces via predictive modeling[C]//Proc of the 12th Int Conf on Architectural Support for Programming Languages and Operating Systems. New York: ACM, 2006: 195–206
[4] Lee B C, Brooks D M. Accurate and efficient regression modeling for microarchitectural performance and power prediction[C]//Proc of the 12th Int Conf on Architectural Support for Programming Languages and Operating Systems. New York: ACM, 2006: 185–194
[5] Lee B C, Collins J D, Wang Hong, et al. CPR: Composable performance regression for scalable multiprocessor models[C]//Proc of the 41st Annual IEEE/ACM Int Symp on Microarchitecture. Piscataway, NJ: IEEE, 2008: 270–281
[6] Li Dandan, Yao Shuzhen, Wang Senzhang, et al. Cross-program design space exploration by ensemble transfer learning[C]//Proc of the 36th IEEE/ACM Int Conf on Computer-Aided Design. Piscataway, NJ: IEEE, 2017: 201–208
[7] Li Dandan, Wang Senzhang, Yao Shuzhen, et al. Efficient design space exploration by knowledge transfer[C]//Proc of the 11th IEEE/ACM/IFIP Int Conf on Hardware/Software Codesign and System Synthesis. New York: ACM, 2016: 12: 1−12: 10
[8] Wu Weidan, Lee B C. Inferred models for dynamic and sparse hardware-software spaces[C]//Proc of the 45th Annual IEEE/ACM Int Symp on Microarchitecture. Los Alamitos, CA: IEEE Computer Society, 2012: 413–424
[9] Wang Yu, Lee V, Wei G Y, et al. Predicting new workload or CPU performance by analyzing public datasets[J]. ACM Transactions on Architecture and Code Optimization, 2019, 15(4): 53: 1−53: 21
[10] Dubach C, Jones T M, O’Boyle M F P. An empirical architecture-centric approach to microarchitectural design space exploration[J]. IEEE Transactions on Computers, 2011, 60(10): 1445−1458 doi: 10.1109/TC.2010.280
[11] Dubach C, Jones T M, O’Boyle M F P. Microarchitectural design space exploration using an architecture-centric approach[C]//Proc of the 40th Annual IEEE/ACM Int Symp on Microarchitecture. Los Alamitos, CA: IEEE Computer Society, 2007: 262–271
[12] Eeckhout L, De Bosschere K. Speeding up architectural simulations for high-performance processors[J]. Simulation, 2004, 80(9): 451−468 doi: 10.1177/0037549704044326
[13] Wang Hongwei, Shi Jinglin, Zhu Ziyuan. An expected hypervolume improvement algorithm for architectural exploration of embedded processors[C]//Proc of the 53rd Annual Design Automation Conf. New York: ACM, 2016: 161: 1−161: 6
[14] Bai Chen, Sun Qi, Zhai Jianwang, et al. BOOM-Explorer: RISC-V BOOM microarchitecture design space exploration framework[C/OL]//Proc of the 40th IEEE/ACM Int Conf on Computer Aided Design. Piscataway, NJ: IEEE, 2021[2023-12-17]. https://ieeexplore.ieee.org/document/9643455
[15] Yi J J, Lilja D J, Hawkins D M. A statistically rigorous approach for improving simulation methodology[C]//Proc of the 9th Int Symp on High-Performance Computer Architecture. Los Alamitos, CA: IEEE Computer Society, 2003: 281–291
[16] Monchiero M, Canal R, González A. Power/performance/thermal design-space exploration for multicore architectures[J]. IEEE Transactions on Parallel and Distributed Systems, 2008, 19(5): 666−681 doi: 10.1109/TPDS.2007.70756
[17] 包云岗,常轶松,韩银和,等. 处理器芯片敏捷设计方法:问题与挑战[J]. 计算机研究与发展,2021,58(6):1131−1145 doi: 10.7544/issn1000-1239.2021.20210232 Bao Yungang, Chang Yisong, Han Yinhe, et al. Agile design of processor chips: Issues and challenges[J]. Journal of Computer Research and Development, 2021, 58(6): 1131−1145 (in Chinese) doi: 10.7544/issn1000-1239.2021.20210232
[18] Standard Performance Evaluation Corporation. SPEC CPU2017[EB/OL]. (2012-12-06)[2023-12-01]. https://www.spec.org/cpu2017
[19] Yi J J, Lilja D J. Simulation of computer architectures: Simulators, benchmarks, methodologies, and recommendations[J]. IEEE Transactions on Computers, 2006, 55(3): 268−280 doi: 10.1109/TC.2006.44
[20] Guo Qi, Chen Tianshi, Chen Yunji, et al. Accelerating architectural simulation via statistical techniques: A survey[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2016, 35(3): 433−446 doi: 10.1109/TCAD.2015.2481796
[21] O’Neal K, Brisk P. Predictive modeling for CPU, GPU, and FPGA performance and power consumption: A survey[C]//Proc of the 2018 IEEE Computer Society Annual Symp on VLSI. Los Alamitos, CA: IEEE Computer Society, 2018: 763–768
[22] Chen Tianshi, Guo Qi, Tang Ke, et al. ArchRanker: A ranking approach to design space exploration[C]//Proc of the 41st Int Symp on Computer Architecture. Los Alamitos, CA: IEEE Computer Society, 2014: 85–96
[23] Ding Yi, Mishra N, Hoffmann H. Generative and multi-phase learning for computer systems optimization[C]//Proc of the 46th Int Symp on Computer Architecture. New York: ACM, 2019: 39–52
[24] Panerati J, Beltrame G. A comparative evaluation of multi-objective exploration algorithms for high-level design[J]. ACM Transactions on Design Automation of Electronic Systems, 2014, 19(2): 15: 1–15: 22
[25] Palermo G, Silvano C, Zaccaria V. ReSPIR: A response surface-based pareto iterative refinement for application-specific design space exploration[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2009, 28(12): 1816−1829 doi: 10.1109/TCAD.2009.2028681
[26] Mariani G, Palermo G, Zaccaria V, et al. DeSpErate++: An enhanced design space exploration framework using predictive simulation scheduling[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2015, 34(2): 293−306 doi: 10.1109/TCAD.2014.2379634
[27] Cammarota R, Beni L A, Nicolau A, et al. Effective evaluation of multi-core based systems[C]//Proc of the 12th Int Symp on Parallel and Distributed Computing. Piscataway, NJ: IEEE, 2013: 19–25
[28] KleinOsowski A J, Lilja D J. MinneSPEC: A new spec benchmark workload for simulation-based computer architecture research[J]. IEEE Computer Architecture Letters, 2002, 1(1): 7−10 doi: 10.1109/L-CA.2002.8
[29] Eeckhout L, Vandierendonck H, De Bosschere K. Workload design: Selecting representative program-input pairs[C]//Proc of the 11th Int Conf on Parallel Architectures and Compilation Techniques. Los Alamitos, CA: IEEE Computer Society, 2002: 83–94
[30] Breughe M, Eeckhout L. Selecting representative benchmark inputs for exploring microprocessor design spaces[J]. ACM Transactions on Architecture and Code Optimization, 2013, 10(4): 37: 1−37: 24
[31] Joshi A, Phansalkar A, Eeckhout L, et al. Measuring benchmark similarity using inherent program characteristics[J]. IEEE Transactions on Computers, 2006, 55(6): 769−782 doi: 10.1109/TC.2006.85
[32] Vandeputte F, Eeckhout L. Phase complexity surfaces: Characterizing time-varying program behavior[C]//Proc of the 3rd High Performance Embedded Architectures and Compilers. Berlin: Springer, 2008: 320–334
[33] Zhan Hongping, Lin Weiwei, Mao Feiqiao, et al. BenchSubset: A framework for selecting benchmark subsets based on consensus clustering[J]. International Journal of Intelligent Systems, 2022, 37(8): 5248−5271 doi: 10.1002/int.22791
[34] Sheidaeian H, Fatemi O. Toward a general framework for jointly processor-workload empirical modeling[J]. The Journal of Supercomputing, 2021, 77(6): 5319−5353 doi: 10.1007/s11227-020-03475-9
[35] Phansalkar A, Joshi A, John L K. Analysis of redundancy and application balance in the SPEC CPU2006 benchmark suite[C]//Proc of the 34th Int Symp on Computer Architecture. New York: ACM, 2007: 412–423
[36] Limaye A, Adegbija T. A workload characterization of the SPEC CPU2017 benchmark suite[C]//Proc of the 2018 IEEE Int Symp on Performance Analysis of Systems and Software. Los Alamitos, CA: IEEE Computer Society, 2018: 149–158
[37] Panda R, Song Shuang, Dean J, et al. Wait of a decade: Did SPEC CPU 2017 broaden the performance horizon[C]//Proc of the 23rd IEEE Int Symp on High Performance Computer Architecture. Los Alamitos, CA: IEEE Computer Society, 2018: 271–282
[38] Liu Qingrui, Wu Xiaolong, Kittinger L, et al. BenchPrime: Effective building of a hybrid benchmark suite[J]. ACM Transactions in Embedded Computing Systems, 2017, 16(5): 179: 1−179: 22
[39] Wunderlich R E, Wenisch T F, Falsafi B, et al. SMARTS: Accelerating microarchitecture simulation via rigorous statistical sampling[C]//Proc of the 30th Annual Int Symp on Computer Architecture. Los Alamitos, CA: IEEE Computer Society, 2003: 84–95
[40] Hassani S, Southern G, Renau J. LiveSim: Going live with microarchitecture simulation[C]//Proc of the 22nd IEEE Int Symp on High Performance Computer Architecture. Los Alamitos, CA: IEEE Computer Society, 2016: 606–617
[41] Hamerly G, Perelman E, Lau J, et al. SimPoint 3.0: Faster and more flexible program phase analysis[J/OL]. Journal of Instruction-Level Parallelism, 2005[2023-12-18]. http://www.jilp.org/vol7/v7paper14.pdf
[42] Sherwood T, Perelman E, Hamerly G, et al. Discovering and exploiting program phases[J]. IEEE Micro, 2003, 23(6): 84−93 doi: 10.1109/MM.2003.1261391
[43] Shen Xipeng, Zhong Yutao, Ding Chen. Locality phase prediction[C]//Proc of the 11th Int Conf on Architectural Support for Programming Languages and Operating Systems. New York: ACM, 2004: 165–176
[44] Ardestani E K, Renau J. ESESC: A fast multicore simulator using time-based sampling[C]//Proc of the 19th Int Symp on High Performance Computer Architecture. Los Alamitos, CA: IEEE Computer Society, 2013: 448–459
[45] Jiang Chuntao, Yu Zhibin, Jin Hai, et al. PCantorSim: Accelerating parallel architecture simulation through fractal-based sampling[J]. ACM Transactions on Architecture and Code Optimization, 2013, 10(4): 49: 1–49: 24
[46] Sabu A, Patil H, Heirman W, et al. LoopPoint: Checkpoint-driven sampled simulation for multi-threaded applications[C]//Proc of the 28th Int Symp on High-Performance Computer Architecture. Piscataway, NJ: IEEE, 2022: 604–618
[47] Carlson T E, Heirman W, Van Craeynest K, et al. BarrierPoint: Sampled simulation of multi-threaded applications[C]//Proc of the 2014 IEEE Int Symp on Performance Analysis of Systems and Software. Los Alamitos, CA: IEEE Computer Society, 2014: 2–12
[48] Grass T, Carlson T E, Rico A, et al. Sampled simulation of task-based programs[J]. IEEE Transactions on Computers, 2019, 68(2): 255−269 doi: 10.1109/TC.2018.2860012
[49] Wenisch T F, Wunderlich R E, Ferdman M, et al. SimFlex: Statistical sampling of computer system simulation[J]. IEEE Micro, 2006, 26(4): 18−31 doi: 10.1109/MM.2006.79
[50] Xu Yinan, Yu Zihao, Tang Dan, et al. Towards developing high performance RISC-V processors using agile methodology[C]//Proc of the 55th IEEE/ACM Int Symp on Microarchitecture. Piscataway, NJ: IEEE, 2022: 1178–1199
[51] Bryan P D, Rosier M C, Conte T M. Reverse state reconstruction for sampled microarchitectural simulation[C]//Proc of the 2007 IEEE Int Symp on Performance Analysis of Systems & Software. Los Alamitos, CA: IEEE Computer Society, 2007: 190–199
[52] Nussbaum S, Smith J E. Modeling superscalar processors via statistical simulation[C]//Proc of the 10th Int Conf on Parallel Architectures and Compilation Techniques. Los Alamitos, CA: IEEE Computer Society, 2001: 15–24
[53] Eeckhout L, Nussbaum S, Smith J E, et al. Statistical simulation: Adding efficiency to the computer designer’s toolbox[J]. IEEE Micro, 2003, 23(5): 26−38 doi: 10.1109/MM.2003.1240210
[54] Oskin M, Chong F T, Farrens M. HLS: Combining statistical and symbolic simulation to guide microprocessor designs[C]//Proc of the 27th Int Symp on Computer Architecture. Los Alamitos, CA: IEEE Computer Society, 2000: 71–82
[55] Bell R H, John L K. Improved automatic testcase synthesis for performance model validation[C]//Proc of the 19th Annual Int Conf on Supercomputing. New York: ACM, 2000: 111–120
[56] Genbrugge D, Eeckhout L. Statistical simulation of chip multiprocessors running multi-program workloads[C]//Proc of the 25th Int Conf on Computer Design. Piscataway, NJ: IEEE, 2007: 464–471
[57] Genbrugge D, Eeckhout L. Chip multiprocessor design space exploration through statistical simulation[J]. IEEE Transactions on Computers, 2009, 58(12): 1668−1681 doi: 10.1109/TC.2009.77
[58] Hughes C, Li T. Accelerating multi-core processor design space evaluation using automatic multi-threaded workload synthesis[C]//Proc of the 4th Int Symp on Workload Characterization. Los Alamitos, CA: IEEE Computer Society, 2008: 163–172
[59] Balakrishnan G, Solihin Y. WEST: Cloning data cache behavior using stochastic traces[C]//Proc of the 18th IEEE Int Symp on High-Performance Comp Architecture. Los Alamitos, CA: IEEE Computer Society, 2012: 1–12
[60] Awad A, Solihin Y. STM: Cloning the spatial and temporal memory access behavior[C]//Proc of the 20th IEEE Int Symp on High Performance Computer Architecture. Los Alamitos, CA: IEEE Computer Society, 2014: 237–247
[61] Wang Yipeng, Awad A, Solihin Y. Clone morphing: Creating new workload behavior from existing applications[C]//Proc of the 2017 IEEE Int Symp on Performance Analysis of Systems and Software. Los Alamitos, CA: IEEE Computer Society, 2017: 97–108
[62] Wang Yipeng, Balakrishnan G, Solihin Y. MeToo: Stochastic modeling of memory traffic timing behavior[C]//Proc of the 24th Int Conf on Parallel Architecture and Compilation. Los Alamitos, CA: IEEE Computer Society, 2015: 457–467
[63] Hekstra G J, La Hei G D, Bingley P, et al. TriMedia CPU64 design space exploration[C]//Proc of the 17th IEEE Int Conf on Computer Design: VLSI in Computers and Processors. Los Alamitos, CA: IEEE Computer Society, 1999: 599–606
[64] Fornaciari W, Sciuto D, Silvano C, et al. A design framework to efficiently explore energy-delay tradeoffs[C]//Proc of the 9th Int Symp on Hardware/Software Codesign. New York: ACM, 2001: 260–265
[65] Fornaciari W, Sciuto D, Silvano C, et al. A sensitivity-based design space exploration methodology for embedded systems[J]. Design Automation for Embedded Systems, 2002, 7(1): 7−33
[66] Sheldon D, Kumar R, Lysecky R, et al. Application-specific customization of parameterized FPGA soft-core processors[C]//Proc of the 25th IEEE/ACM Int Conf on Computer-Aided Design. New York: ACM, 2006: 261–268
[67] Li Dandan, Yao Shuzhen, Liu Yuhang, et al. Efficient design space exploration via statistical sampling and AdaBoost learning[C]//Proc of the 53rd Annual Design Automation Conf. New York: ACM, 2016: 142: 1−142: 6
[68] Wang Hongwei, Zhu Ziyuan, Shi Jinglin, et al. An accurate acosso metamodeling technique for processor architecture design space exploration[C]//Proc of the 20th Asia and South Pacific Design Automation Conf. Piscataway, NJ: IEEE, 2015: 689–694
[69] Mariani G, Palermo G, Zaccaria V, et al. Design-space exploration and runtime resource management for multicores[J]. ACM Transactions on Embedded Computing Systems, 2013, 13(2): 20: 1−20: 27
[70] Jahr R, Calborean H, Vintan L, et al. Boosting design space explorations with existing or automatically learned knowledge[C]//Proc of the 15th Measurement, Modelling, and Evaluation of Computing Systems and Dependability and Fault Tolerance. Berlin: Springer, 2012: 221–235
[71] Palesi M, Givargis T. Multi-objective design space exploration using genetic algorithms[C]//Proc of the 10th Int Symp on Hardware/Software Codesign. New York: ACM, 2002: 67–72
[72] Eyerman S, Eeckhout L, De Bosschere K. Efficient design space exploration of high performance embedded out-of-order processors[C]//Proc of the 9th Design, Automation & Test in Europe Conf and Exhibition. Piscataway, NJ: IEEE, 2006: 351−356
[73] Ascia G, Catania V, Di Nuovo A G, et al. Efficient design space exploration for application specific systems-on-a-chip[J]. Journal of Systems Architecture, 2007, 53(10): 733−750 doi: 10.1016/j.sysarc.2007.01.004
[74] Mariani G, Palermo G, Silvano C, et al. Multi-processor system-on-chip design space exploration based on multi-level modeling techniques[C]//Proc of the 9th Int Conf on Embedded Computer Systems: Architectures, Modeling and Simulation. Piscataway, NJ: IEEE, 2009: 118–124
[75] Mariani G, Palermo G, Zaccaria V, et al. OSCAR: An optimization methodology exploiting spatial correlation in multicore design spaces[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2012, 31(5): 740−753 doi: 10.1109/TCAD.2011.2177457
[76] Burger D, Austin T M. The SimpleScalar tool set, version 2.0[J]. ACM SIGARCH Computer Architecture News, 1997, 25(3): 13−25 doi: 10.1145/268806.268810
[77] Renau J, Fraguela B, Tuck J, et al. SESC simulator[EB/OL]. 2005[2023-12-01]. http://sesc.sourceforge.net
[78] Binkert N, Beckmann B, Black G, et al. The gem5 simulator[J]. ACM SIGARCH Computer Architecture News, 2011, 39(2): 1−7 doi: 10.1145/2024716.2024718
[79] Chiou D, Sunwoo D, Kim J, et al. FPGA-accelerated simulation technologies (FAST): Fast, full-system, cycle-accurate simulators[C]//Proc of the 40th Annual IEEE/ACM Int Symp on Microarchitecture. Los Alamitos, CA: IEEE Computer Society, 2007: 249–261
[80] Chung E S, Nurvitadhi E, Hoe J C, et al. A complexity-effective architecture for accelerating full-system multiprocessor simulations using FPGAs[C]//Proc of the 16th Int ACM/SIGDA Symp on Field Programmable Gate Arrays. New York: ACM, 2008: 77–86
[81] Chung E S, Papamichael M K, Nurvitadhi E, et al. ProtoFlex: Towards scalable, full-system multiprocessor simulations using FPGAs[J]. ACM Transactions on Reconfigurable Technology and Systems, 2009, 2(2): 15: 1–15: 32
[82] Tan Zhangxi, Waterman A, Avizienis R, et al. RAMP Gold: An FPGA-based architecture simulator for multiprocessors[C]//Proc of the 47th Design Automation Conf. New York: ACM, 2010: 463–468
[83] Pellauer M, Adler M, Kinsy M, et al. HAsim: FPGA-based high-detail multicore simulation using time-division multiplexing[C]//Proc of the 17th Int Symp on High Performance Computer Architecture. Los Alamitos, CA: IEEE Computer Society, 2011: 406–417
[84] Karandikar S, Mao H, Kim D, et al. FireSim: FPGA-accelerated cycle-exact scale-out system simulation in the public cloud[C]//Proc of the 45th Annual Int Symp on Computer Architecture. Los Alamitos, CA: IEEE Computer Society, 2018: 29–42
[85] Balkind J, McKeown M, Fu Yaosheng, et al. OpenPiton: An open source manycore research framework[C]//Proc of the 21st Int Conf on Architectural Support for Programming Languages and Operating Systems. New York: ACM, 2016: 217–232
[86] Wang Shenghong, Possignolo R T, Skinner H B, et al. LiveHD: A productive live hardware development flow[J]. IEEE Micro, 2020, 40(4): 67−75 doi: 10.1109/MM.2020.2996508
[87] Petrisko D, Gilani F, Wyse M, et al. BlackParrot: An agile open-source RISC-V multicore for accelerator socs[J]. IEEE Micro, 2020, 40(4): 93−102 doi: 10.1109/MM.2020.2996145
[88] Zhang Sizhuo, Wright A, Bourgeat T, et al. Composable building blocks to open up processor design[C]//Proc of the 51st Annual IEEE/ACM Int Symp on Microarchitecture. Los Alamitos, CA: IEEE Computer Society, 2018: 68–81
[89] Amid A, Biancolin D, Gonzalez A, et al. Chipyard: Integrated design, simulation, and implementation framework for custom socs[J]. IEEE Micro, 2020, 40(4): 10−21 doi: 10.1109/MM.2020.2996616
[90] Joseph P J, Vaswani K, Thazhuthaveetil M J. Construction and use of linear regression models for processor performance analysis[C]//Proc of the 12th Int Symp on High-Performance Computer Architecture. Los Alamitos, CA: IEEE Computer Society, 2006: 99–108
[91] Agarwal N, Jain T, Zahran M. Performance prediction for multi-threaded applications[C]//Proc of the 2nd Int Workshop on AI-assisted Design for Architecture. New York: ACM, 2019: 71−76
[92] Joseph P J, Vaswani K, Thazhuthaveetil M J. A predictive performance model for superscalar processors[C]//Proc of the 39th Annual IEEE/ACM Int Symp on Microarchitecture. Los Alamitos, CA: IEEE Computer Society, 2006: 161–170
[93] Cho C B, Zhang Wangyuan, Li Tao. Informed microarchitecture design space exploration using workload dynamics[C]//Proc of the 40th Annual IEEE/ACM Int Symp on Microarchitecture. Los Alamitos, CA: IEEE Computer Society, 2007: 274–285
[94] Ould-Ahmed-Vall E, Woodlee J, Yount C, et al. Using model trees for computer architecture performance analysis of software applications[C]//Proc of the 2007 IEEE Int Symp on Performance Analysis of Systems and Software. Los Alamitos, CA: IEEE Computer Society, 2007: 116–125
[95] Powell A, Savvas-Bouganis C, Cheung P Y K. High-level power and performance estimation of FPGA-based soft processors and its application to design space exploration[J]. Journal of Systems Architecture, 2013, 59(10): 1144−1156 doi: 10.1016/j.sysarc.2013.08.003
[96] Mankodi A, Bhatt A, Chaudhury B. Predicting physical computer systems performance and power from simulation systems using machine learning model[J]. Computing, 2022, 105(5): 1−19
[97] Li Dandan, Yao Shuzhen, Wang Ying. Processor design space exploration via statistical sampling and semi-supervised ensemble learning[J]. IEEE Access, 2018, 6: 25495−25505 doi: 10.1109/ACCESS.2018.2831079
[98] Guo Qi, Chen Tianshi, Chen Yunji, et al. Effective and efficient microprocessor design space exploration using unlabeled design configurations[C]//Proc of the 22nd Int Joint Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2011: 1671–1677
[99] Hoste K, Phansalkar A, Eeckhout L, et al. Performance prediction based on inherent program similarity[C]//Proc of the 15th Int Conf on Parallel Architectures and Compilation Techniques. New York: ACM, 2006: 114–122
[100] Guo Qi, Chen Tianshi, Chen Yunji, et al. Microarchitectural design space exploration made fast[J]. Microprocessors and Microsystems, 2013, 37(1): 41−51 doi: 10.1016/j.micpro.2012.07.006
[101] Ahmadinejad H, Fatemi O. Moving towards grey-box predictive models at micro-architecture level by investigating inherent program characteristics[J]. IET Computers Digital Techniques, 2018, 12(2): 53−61 doi: 10.1049/iet-cdt.2016.0148
[102] Taha T M, Wills S. An instruction throughput model of superscalar processors[J]. IEEE Transactions on Computers, IEEE, 2008, 57(3): 389−403 doi: 10.1109/TC.2007.70817
[103] Xu Chi, Chen Xi, Dick R P, et al. Cache contention and application performance prediction for multi-core systems[C]//Proc of the 2010 IEEE Int Symp on Performance Analysis of Systems & Software. Los Alamitos, CA: IEEE Computer Society, 2010: 76–86
[104] Eyerman S, Eeckhout L, Karkhanis T, et al. A mechanistic performance model for superscalar out-of-order processors[J]. ACM Transactions on Computer Systems, 2009, 27(2): 3: 1–3: 37
[105] Breughe M B, Eyerman S, Eeckhout L. Mechanistic analytical modeling of superscalar in-order processor performance[J]. ACM Transactions on Architecture and Code Optimization, 2015, 11(4): 50: 1–50: 26
[106] Carlson T E, Heirman W, Eeckhout L. Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulation[C]//Proc of the 2011 Conf on High Performance Computing Networking, Storage and Analysis. New York: ACM, 2011: 52: 1−52: 12
[107] Wang Lei, Tang Yuxing, Deng Yu, et al. A scalable and fast microprocessor design space exploration methodology[C]//Proc of the 9th Int Symp on Embedded Multicore/Many-core Systems-on-Chip. Los Alamitos, CA: IEEE Computer Society, 2015: 33–40
[108] Lee J, Jang H, Kim J. RpStacks: Fast and accurate processor design space exploration using representative stall-event stacks[C]//Proc of the 47th Annual IEEE/ACM Int Symp on Microarchitecture. Los Alamitos, CA: IEEE Computer Society, 2014: 255–267
[109] Jang H, Jo J E, Lee J, et al. RpStacks-MT: A high-throughput design evaluation methodology for multi-core processors[C]//Proc of the 51st Annual IEEE/ACM Int Symp on Microarchitecture. Los Alamitos, CA: IEEE Computer Society, 2018: 586–599
[110] Noonburg D B, Shen J P. A framework for statistical modeling of superscalar processor performance[C]//Proc of the 3rd Int Symp on High-Performance Computer Architecture. Los Alamitos, CA: IEEE Computer Society, 1997: 298–309
[111] Chen X E, Aamodt T M. A first-order fine-grained multithreaded throughput model[C]//Proc of the 15th Int Symp on High Performance Computer Architecture. Los Alamitos, CA: IEEE Computer Society, 2009: 329–340
[112] Liang Y, Mitra T. An analytical approach for fast and accurate design space exploration of instruction caches[J]. ACM Transactions on Embedded Computing Systems, 2013, 13(3): 43: 1−43: 29
[113] Hartstein A, Puzak T R. The optimum pipeline depth for a microprocessor[C]//Proc of the 29th Annual Int Symp on Computer Architecture. Los Alamitos, CA: IEEE Computer Society, 2002: 7–13
[114] Chen X E, Aamodt T M. Hybrid analytical modeling of pending cache hits, data prefetching, and mshrs[J]. ACM Transactions on Architecture and Code Optimization, 2011, 8(3): 59−70
[115] Li L, Pandey S, Flynn T, et al. SimNet: Accurate and high-performance computer architecture simulation using deep learning[C]//Proc of the 2022 ACM SIGMETRICS/IFIP Performance Joint Int Conf on Measurement and Modeling of Computer Systems. New York: ACM, 2022: 67–68
[116] Panda R, John L K. Proxy benchmarks for emerging big-data workloads[C]//Proc of the 26th Int Conf on Parallel Architectures and Compilation Techniques. Los Alamitos, CA: IEEE Computer Society, 2017: 105–116
[117] Kang S, Kumar R. Magellan: A search and machine learning-based framework for fast multi-core design space exploration and optimization[C]//Proc of the 2008 Design, Automation and Test in Europe. New York: ACM, 2008: 1432–1437
[118] Guo Qi, Chen Tianshi, Zhou Zhihua, et al. Robust design space modeling[J]. ACM Transactions on Design Automation of Electronic Systems, 2015, 20(2): 18: 1–18: 22
[119] 张乾龙,侯锐,杨思博,等. 体系结构模拟器在处理器设计过程中的作用[J]. 计算机研究与发展,2019,56(12):2702−2719 doi: 10.7544/issn1000-1239.2019.20190044 Zhang Qianlong, Hou Rui, Yang Sibo, et al. The role of architecture simulators in the process of CPU design[J]. Journal of Computer Research and Development, 2019, 56(12): 2702−2719 (in Chinese) doi: 10.7544/issn1000-1239.2019.20190044
[120] Hoste K, Eeckhout L. Microarchitecture-independent workload characterization[J]. IEEE Micro, 2007, 27(3): 63−72 doi: 10.1109/MM.2007.56
[121] Jin Zhanpeng, Cheng A C. Evolutionary benchmark subsetting[J]. IEEE Micro, 2008, 28(6): 20−36 doi: 10.1109/MM.2008.87
[122] Jin Zhanpeng, Cheng A C. SubsetTrio: An evolutionary, geometric, and statistical benchmark subsetting framework[J]. ACM Transactions on Modeling and Computer Simulation, 2011, 21(3): 21: 1–21: 23
[123] Jin Zhanpeng, Cheng A C. Improve simulation efficiency using statistical benchmark subsetting: An implantbench case study[C]//Proc of the 45th Annual Design Automation Conf. New York: ACM, 2008: 970–973
[124] Lee C, Potkonjak M, Mangione-Smith W H. MediaBench: A tool for evaluating and synthesizing multimedia and communications systems[C]//Proc of the 30th Annual Int Symp on Microarchitecture. Los Alamitos, CA: IEEE Computer Society, 1997: 330–335
[125] Guthaus M R, Ringenberg J S, Ernst D, et al. MiBench: A free, commercially representative embedded benchmark suite[C]//Proc of the 4th Annual IEEE Int Workshop on Workload Characterization. Piscataway, NJ: IEEE, 2001: 3–14
[126] Standard Performance Evaluation Corporation. SPEC CPU2000[EB/OL]. (2007-06-07)[2023-12-01]. https://www.spec.org/cpu2000
[127] Standard Performance Evaluation Corporation. SPEC CPU2006[EB/OL]. (2023-01-06)[2023-12-01]. https://www.spec.org/cpu2006
[128] Bienia C, Kumar S, Singh J P, et al. The parsec benchmark suite: Characterization and architectural implications[C]//Proc of the 17th Int Conf on Parallel Architectures and Compilation Techniques. New York: ACM, 2008: 72–81
[129] Woo S C, Ohara M, Torrie E, et al. The splash−2 programs: Characterization and methodological considerations[C]//Proc of the 22nd Annual Int Symp on Computer architecture. New York: ACM, 1995: 24–36
[130] Chandra D, Guo Fei, Kim S, et al. Predicting inter-thread cache contention on a chip multi-processor architecture[C]//Proc of the 11th Int Symp on High-Performance Computer Architecture. Los Alamitos, CA: IEEE Computer Society, 2005: 340–351
[131] Hsu W C, Chen H, Yew P C, et al. On the predictability of program behavior using different input data sets[C]//Proc of the 6th Annual Workshop on Interaction between Compilers and Computer Architectures. Los Alamitos, CA: IEEE Computer Society, 2002: 45–53
[132] Hoste K, Eeckhout L. Comparing benchmarks using key microarchitecture-independent characteristics[C]//Proc of the 2nd IEEE Int Symp on Workload Characterization. Los Alamitos, CA: IEEE Computer Society, 2006: 83–92
[133] Yi J J, Sendag R, Eeckhout L, et al. Evaluating benchmark subsetting approaches[C]//Proc of the 2nd IEEE Int Symp on Workload Characterization. Los Alamitos, CA: IEEE Computer Society, 2006: 93–104
[134] Conte T M, Hirsch M A, Menezes K N. Reducing state loss for effective trace sampling of superscalar processors[C]//Proc of the 14th Int Conf on Computer Design. Los Alamitos, CA: IEEE Computer Society, 1996: 468–477
[135] Patil H, Cohn R, Charney M, et al. Pinpointing representative portions of large Intel® Itanium® programs with dynamic instrumentation[C]//Proc of the 37th Int Symp on Microarchitecture. Los Alamitos, CA: IEEE Computer Society, 2004: 81–92
[136] Nair A A, John L K. Simulation points for SPEC CPU 2006[C]//Proc of the 26th Int Conf on Computer Design. Los Alamitos, CA: IEEE Computer Society, 2008: 397–403
[137] Lau J, Perelman E, Calder B. Selecting software phase markers with code structure analysis[C]//Proc of the 4th Int Symp on Code Generation and Optimization. Los Alamitos, CA: IEEE Computer Society, 2006: 135–146
[138] Lahiri K, Kunnoth S. Fast IPC estimation for performance projections using proxy suites and decision trees[C]//Proc of the 2017 IEEE Int Symp on Performance Analysis of Systems and Software. Los Alamitos, CA: IEEE Computer Society, 2017: 77–86
[139] Carlson T E, Heirman W, Eeckhout L. Sampled simulation of multi-threaded applications[C]//Proc of the 2013 IEEE Int Symp on Performance Analysis of Systems and Software. Los Alamitos, CA: IEEE Computer Society, 2013: 2–12
[140] Patil H, Pereira C, Stallcup M, et al. PinPlay: A framework for deterministic replay and reproducible analysis of parallel programs[C]//Proc of the 8th Annual IEEE/ACM Int Symp on Code Generation and Optimization. New York: ACM, 2010: 2–11
[141] Patil H, Isaev A, Heirman W, et al. ELFies: Executable region checkpoints for performance analysis and simulation[C]//Proc of the 19th IEEE/ACM Int Symp on Code Generation and Optimization. Piscataway, NJ: IEEE, 2021: 126–136
[142] Wenisch T F, Wunderlich R E, Falsafi B, et al. TurboSMARTS: Accurate microarchitecture simulation sampling in minutes[J]. ACM SIGMETRICS Performance Evaluation Review, 2005, 33(1): 408−409 doi: 10.1145/1071690.1064278
[143] Khan T M, Pérez D G, Temam O. Transparent sampling[C]//Proc of the 10th Int Conf on Embedded Computer Systems: Architectures, Modeling and Simulation. Piscataway, NJ: IEEE, 2010: 28–36
[144] Eeckhout L, Luo Yue, De Bosschere K, et al. BLRL: Accurate and efficient warmup for sampled processor simulation[J]. The Computer Journal, 2005, 48(4): 451−459 doi: 10.1093/comjnl/bxh103
[145] Haskins J W, Skadron K. Accelerated warmup for sampled microarchitecture simulation[J]. ACM Transactions on Architecture and Code Optimization, 2005, 2(1): 78−108 doi: 10.1145/1061267.1061272
[146] Van Ertvelde L, Hellebaut F, Eeckhout L. Accurate and efficient cache warmup for sampled processor simulation through NSL–BLRL[J]. The Computer Journal, 2008, 51(2): 192−206
[147] Jiang Chuntao, Yu Zhibin, Jin Hai, et al. Shorter on-line warmup for sampled simulation of multi-threaded applications[C]//Proc of the 44th Int Conf on Parallel Processing. Los Alamitos, CA: IEEE Computer Society, 2015: 350–359
[148] Bell R, Eeckhout L, John L, et al. Deconstructing and improving statistical simulation in HLS[C]//Proc of the 2004 Workshop on Duplicating, Deconstructing and Debunking held in Conjunction with the 31st Annual Int Symp on Computer Architecture. New York: ACM, 2004: 2−12
[149] Joshi A, Yi J J, Bell R H, et al. Evaluating the efficacy of statistical simulation for design space exploration[C]//Proc of the 2006 IEEE Int Symp on Performance Analysis of Systems and Software. Los Alamitos, CA: IEEE Computer Society, 2006: 70–79
[150] Eeckhout L, Bell R H, Stougie B, et al. Control flow modeling in statistical simulation for accurate and efficient processor design studies[C]//Proc of the 31st Annual Int Symp on Computer Architecture. Los Alamitos, CA: IEEE Computer Society, 2004: 350–361
[151] Bell R H, Bhatia R R, John L K, et al. Automatic testcase synthesis and performance model validation for high performance PowerPC processors[C]//Proc of the 2006 IEEE Int Symp on Performance Analysis of Systems and Software. Los Alamitos, CA: IEEE Computer Society, 2006: 154–165
[152] Lee H R, Sánchez D. Datamime: Generating representative benchmarks by automatically synthesizing datasets[C]//Proc of the 55th IEEE/ACM Int Symp on Microarchitecture. Piscataway, NJ: IEEE, 2022: 1144–1159
[153] Joshi A, Eeckhout L, Bell R H, et al. Performance cloning: A technique for disseminating proprietary applications as benchmarks[C]//Proc of the 2nd IEEE Int Symp on Workload Characterization. Los Alamitos, CA: IEEE Computer Society, 2006: 105–115
[154] Joshi A M, Eeckhout L, John L K, et al. Automated microprocessor stressmark generation[C]//Proc of the 14th Int Symp on High Performance Computer Architecture. Los Alamitos, CA: IEEE Computer Society, 2008: 229–239
[155] Joshi A, Eeckhout L, Bell R H, et al. Distilling the essence of proprietary workloads into miniature benchmarks[J]. ACM Transactions on Architecture and Code Optimization, 2008, 5(2): 10: 1–10: 33
[156] Ganesan K, John L K. Automatic generation of miniaturized synthetic proxies for target applications to efficiently design multicore processors[J]. IEEE Transactions on Computers, 2014, 63(4): 833−846 doi: 10.1109/TC.2013.36
[157] Deniz E, Sen A, Kahne B, et al. MINIME: Pattern-aware multicore benchmark synthesizer[J]. IEEE Transactions on Computers, 2015, 64(8): 2239−2252 doi: 10.1109/TC.2014.2349522
[158] Lee K, Evans S, Cho S. Accurately approximating superscalar processor performance from traces[C]//Proc of the 2009 IEEE Int Symp on Performance Analysis of Systems and Software. Los Alamitos, CA: IEEE Computer Society, 2009: 238–248
[159] Lee K, Cho S. In-N-Out: Reproducing out-of-order superscalar processor behavior from reduced in-order traces[C]//Proc of the 19th Annual Int Symp on Modelling, Analysis, and Simulation of Computer and Telecommunication Systems. Los Alamitos, CA: IEEE Computer Society, 2011: 126–135
[160] Lee K, Cho S. Accurately modeling superscalar processor performance with reduced trace[J]. Journal of Parallel and Distributed Computing, 2013, 73(4): 509−521 doi: 10.1016/j.jpdc.2012.12.002
[161] Ganesan K, Jo J, John L K. Synthesizing memory-level parallelism aware miniature clones for SPEC CPU2006 and ImplantBench workloads[C]//Proc of the 2010 IEEE Int Symp on Performance Analysis of Systems & Software. Los Alamitos, CA: IEEE Computer Society, 2010: 33–44
[162] Panda R, Zheng Xinnian, John L K. Accurate address streams for LLC and beyond (SLAB): A methodology to enable system exploration[C]//Proc of the 2017 IEEE Int Symp on Performance Analysis of Systems and Software. Los Alamitos, CA: IEEE Computer Society, 2017: 87–96
[163] Van Biesbrouck M, Sherwood T, Calder B. A co-phase matrix to guide simultaneous multithreading simulation[C]//Proc of the 2004 IEEE Int Symp on Performance Analysis of Systems and Software. Los Alamitos, CA: IEEE Computer Society, 2004: 45–56
[164] Yi J J, Kodakara S V, Sendag R, et al. Characterizing and comparing prevailing simulation techniques[C]//Proc of the 11th Int Symp on High-Performance Computer Architecture. Los Alamitos, CA: IEEE Computer Society, 2005: 266–277
[165] Tairum Cruz M, Bischoff S, Rusitoru R. Shifting the barrier: Extending the boundaries of the barrierpoint methodology[C]//Proc of the 2018 IEEE Int Symp on Performance Analysis of Systems and Software. Los Alamitos, CA: IEEE Computer Society, 2018: 120–122
[166] Bell R H, John L K. Efficient power analysis using synthetic testcases[C]//Proc of the 1st IEEE Int Symp Workload Characterization. Piscataway, NJ: IEEE, 2005: 110–118
[167] Penry D A, Fay D, Hodgdon D, et al. Exploiting parallelism and structure to accelerate the simulation of chip multi-processors[C]//Proc of the 12th Int Symp on High-Performance Computer Architecture. Los Alamitos, CA: IEEE Computer Society, 2006: 29–40
[168] Mariani G, Palermo G, Zaccaria V, et al. DeSpErate: Speeding-up design space exploration by using predictive simulation scheduling[C/OL]//Proc of the 17th Design, Automation & Test in Europe Conf & Exhibition. Piscataway, NJ: IEEE, 2014[2023-12-18]. https://ieeexplore.ieee.org/document/6800432?arnumber=6800432
[169] Li Bin, Peng Lu, Ramadass B. Accurate and efficient processor performance prediction via regression tree based modeling[J]. Journal of Systems Architecture, 2009, 55(10): 457−467
[170] Pang Jiufeng, Li Xiafeng, Xie Jinsong, et al. Microarchitectural design space exploration via support vector machine[J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2010, 46(1): 55−63
[171] Cook H, Skadron K. Predictive design space exploration using genetically programmed response surfaces[C]//Proc of the 45th Annual Design Automation Conf. New York: ACM, 2008: 960–965
[172] Zhai Jianwang, Bai Chen, Zhu Binwu, et al. McPAT-Calib: A microarchitecture power modeling framework for modern CPUs[C/OL]//Proc of the 40th IEEE/ACM Int Conf on Computer Aided Design. Piscataway, NJ: IEEE, 2021[2023-12-18]. https://ieeexplore.ieee.org/document/9643508
[173] Zhai Jianwang, Bai Chen, Zhu Binwu, et al. McPAT-calib: A RISC-V boom microarchitecture power modeling framework[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2023, 42(1): 243−256 doi: 10.1109/TCAD.2022.3169464
[174] Givargis T, Vahid F, Henkel J. System-level exploration for Pareto-optimal configurations in parameterized systems-on-a-chip[C]//Proc of the 20th IEEE/ACM Int Conf on Computer Aided Design. Los Alamitos, CA: IEEE Computer Society, 2001: 25–30
[175] Yazdani R, Sheidaeian H, Salehi M E. A fast design space exploration for VLIW architectures[C]//Proc of the 22nd Iranian Conf on Electrical Engineering. Piscataway, NJ: IEEE, 2014: 856–861
[176] Kansakar P, Munir A. A design space exploration methodology for parameter optimization in multicore processors[J]. IEEE Transactions on Parallel and Distributed Systems, 2018, 29(1): 2−15 doi: 10.1109/TPDS.2017.2745580
[177] Ascia G, Catania V, Di Nuovo A G, et al. Performance evaluation of efficient multi-objective evolutionary algorithms for design space exploration of embedded computer systems[J]. Applied Soft Computing, 2011, 11(1): 382−398 doi: 10.1016/j.asoc.2009.11.029
[178] Mariani G, Palermo G, Silvano C, et al. An efficient design space exploration methodology for multi-cluster VLIW architectures based on artificial neural networks[C]//Proc of the 16th IFIP/IEEE Int Conf on Very Large Scale Integration. Piscataway, NJ: IEEE, 2008: 13−15
[179] Zaccaria V, Palermo G, Castro F, et al. MULTICUBE Explorer: An open source framework for design space exploration of chip multi-processors[C]//Proc of the 23rd Int Conf on Architecture of Computing Systems. Hannover, Germany: VDE Verlag, 2010: 325–331
[180] Mariani G, Brankovic A, Palermo G, et al. A correlation-based design space exploration methodology for multi-processor systems-on-chip[C]//Proc of the 47th Design Automation Conf. New York: ACM, 2010: 120–125
[181] Wang Duo, Yan Mingyu, Liu Xin, et al. A high-accurate multi-objective exploration framework for design space of CPU[C/OL]//Proc of the 60th ACM/IEEE Design Automation Conf. Piscataway, NJ: IEEE, 2023[2023-12-18]. https://ieeexplore.ieee.org/document/10247790
[182] Wang Duo, Yan Mingyu, Teng Yihan, et al. A high-accurate multi-objective ensemble exploration framework for design space of CPU microarchitecture[C]//Proc of the 33rd Great Lakes Symp on VLSI 2023. New York: ACM, 2023: 379–383
[183] Beltrame G, Fossati L, Sciuto D. Decision-theoretic design space exploration of multiprocessor platforms[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2010, 29(7): 1083−1095
[184] Beltrame G, Nicolescu G. A multi-objective decision-theoretic exploration algorithm for platform-based design[C]//Proc of the 14th Design, Automation & Test in Europe Conf & Exhibition. Piscataway, NJ: IEEE, 2011: 1192−1195
[185] Sheldon D, Vahid F, Lonardi S. Soft-core processor customization using the design of experiments paradigm[C]//Proc of the 10th Design, Automation & Test in Europe Conf & Exhibition. Piscataway, NJ: IEEE, 2007: 821−826
[186] Mariani G, Palermo G, Silvano C, et al. Meta-model assisted optimization for design space exploration of multi-processor systems-on-chip[C]//Proc of the 12th Euromicro Conf on Digital System Design, Architectures, Methods and Tools. Los Alamitos, CA: IEEE Computer Society, 2009: 383–389
[187] Palermo G, Silvano C, Zaccaria V. Multi-objective design space exploration of embedded systems[J]. Journal of Embedded Computing, 2005, 1(3): 305−316
[188] Wu Nan, Xie Yuan, Hao Cong. IronMan: GNN-assisted design space exploration in high-level synthesis via reinforcement learning[C]//Proc of the 31st Great Lakes Symp on VLSI. New York: ACM, 2021: 39–44
[189] Wu Nan, Xie Yuan, Hao Cong. IronMan-Pro: Multiobjective design space exploration in HLS via reinforcement learning and graph neural network-based modeling[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2023, 42(3): 900−913 doi: 10.1109/TCAD.2022.3185540
[190] Kao S C, Jeong G, Krishna T. ConfuciuX: Autonomous hardware resource assignment for DNN accelerators using reinforcement learning[C]//Proc of the 53rd Annual IEEE/ACM Int Symp on Microarchitecture. Piscataway, NJ: IEEE, 2020: 622–636
[191] Feng Lang, Liu Wenjian, Guo Chuliang, et al. GANDSE: Generative adversarial network based design space exploration for neural network accelerator design[J]. ACM Transactions on Design Automation of Electronic Systems, 2023, 28(3): 35: 1−35: 20
[192] Akram A, Sawalha L. A survey of computer architecture simulation techniques and tools[J]. IEEE Access, 2019, 7: 78120−78145 doi: 10.1109/ACCESS.2019.2917698
[193] Manjikian N. Multiprocessor enhancements of the simplescalar tool set[J]. SIGARCH Computer Architecture News, 2001, 29(1): 8−15 doi: 10.1145/373574.373578
[194] Qureshi Y M, Simon W A, Zapater M, et al. Gem5-X: A many-core heterogeneous simulation platform for architectural exploration and optimization[J]. ACM Transactions on Architecture and Code Optimization, 2021, 18(4): 44: 1–44: 27
[195] Carlson T E, Heirman W, Eyerman S, et al. An evaluation of high-level mechanistic core models[J]. ACM Transactions on Architecture and Code Optimization, 2014, 11(3): 28: 1–28: 25
[196] Tan Zhangxi, Waterman A, Cook H, et al. A case for fame: FPGA architecture model execution[C]//Proc of the 37th Annual Int Symp on Computer Architecture. New York: ACM, 2010: 290–301
[197] Lee Y, Waterman A, Cook H, et al. An agile approach to building RISC-V microprocessors[J]. IEEE Micro, 2016, 36(2): 8−20 doi: 10.1109/MM.2016.11
[198] Di Biagio A, Davis M. llvm-mca: A static performance analysis tool[EB/OL]. (2018−03−01)[2023-12-01]. https://lists.llvm.org/pipermail/llvm-dev/2018-March/121490.html
[199] Mendis C, Renda A, Amarasinghe D S, et al. Ithemal: Accurate, portable and fast basic block throughput estimation using deep neural networks[C]//Proc of the 36th Int Conf on Machine Learning. New York: PMLR, 2019: 4505–4515
[200] Blocklove J, Garg S, Karri R, et al. Chip-Chat: Challenges and opportunities in conversational hardware design[C/OL]//Proc of the 5th ACM/IEEE Workshop on Machine Learning for CAD. Piscataway, NJ: IEEE, 2023[2023-12-18]. https://ieeexplore.ieee.org/document/10299874
[201] Chang Kaiyan, Wang Ying, Ren Haimeng, et al. ChipGPT: How far are we from natural language hardware design[J]. arXiv preprint, arXiv: 2305.14019, 2023
[202] Lu Yao, Liu Shang, Zhang Qijun, et al. RTLLM: An open-source benchmark for design RTL generation with large language model[J]. arXiv preprint, arXiv: 2308.05345, 2023
[203] Balkind J, Chang Tingjung, Jackson P J, et al. OpenPiton at 5: A nexus for open and agile hardware design[J]. IEEE Micro, 2020, 40(4): 22−31 doi: 10.1109/MM.2020.2997706
[204] Bachrach J, Vo H, Richards B, et al. Chisel: Constructing hardware in a scala embedded language[C]//Proc of the 49th Annual Design Automation Conf. New York: ACM, 2012: 1216–1225
[205] Patel H D, Shukla S K. Tackling an abstraction gap: Co-simulating SystemC DE with Bluespec ESL[C]//Proc the of 10th Design, Automation & Test in Europe Conf & Exhibition. Piscataway, NJ: IEEE, 2007: 279−284
[206] Bourgeat T, Pit-Claudel C, Chlipala A, et al. The essence of Bluespec: A core language for rule-based hardware design[C]//Proc of the 41st ACM SIGPLAN Conf on Programming Language Design and Implementation. New York: ACM, 2020: 243–257
[207] Käyrä M, Hämäläinen T D. A survey on system-on-a-chip design using Chisel HW construction language[C/OL]//Proc of the 47th Annual Conf of the IEEE Industrial Electronics Society. Piscataway, NJ: IEEE, 2021[2023-12-18]. https://ieeexplore.ieee.org/document/9589614
[208] 王凯帆,徐易难,余子濠,等. 香山开源高性能RISC-v处理器设计与实现[J]. 计算机研究与发展,2023,60(3):476−493 doi: 10.7544/issn1000-1239.202221036 Wang Kaifan, Xu Yinan, Yu Zihao, et al. XiangShan open-source high performance RISC-V processor design and implementation[J]. Journal of Computer Research and Development, 2023, 60(3): 476−493(in Chinese) doi: 10.7544/issn1000-1239.202221036
[209] Lee B C, Brooks D M, Supinski B R de, et al. Methods of inference and learning for performance modeling of parallel applications[C]//Proc of the 12th ACM SIGPLAN Sympon Principles and Practice of Parallel Programming. New York: ACM, 2007: 249–258
[210] Hallschmid P, Saleh R. Fast design space exploration using local regression modeling with application to ASIPs[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2008, 27(3): 508−515 doi: 10.1109/TCAD.2008.915532
[211] Zhang Changshu, Ravindran A, Datta K, et al. A machine learning approach to modeling power and performance of chip multiprocessors[C]//Proc of the 29th Int Conf on Computer Design. Los Alamitos, CA: IEEE Computer Society, 2011: 45–50
[212] Beg A, Prasad P W C, Singh A K, et al. A neural model for processor-throughput using hardware parameters and software’s dynamic behavior[C]//Proc of the 12th Int Conf on Intelligent Systems Design and Applications. Piscataway, NJ: IEEE, 2012: 821–825
[213] Paone E, Vahabi N, Zaccaria V, et al. Improving simulation speed and accuracy for many-core embedded platforms with ensemble models[C]//Proc of the 16th Design, Automation & Test in Europe Conf & Exhibition. Piscataway, NJ: IEEE, 2013: 671–676
[214] Castillo P A, Mora A M, Guervós J J M, et al. Architecture performance prediction using evolutionary artificial neural networks[C]//Proc of the Applications of Evolutionary Computing. Berlin: Springer, 2008: 175–183
[215] Khan S, Xekalakis P, Cavazos J, et al. Using predictive modeling for cross-program design space exploration in multicore systems[C]//Proc of the 16th Int Conf on Parallel Architecture and Compilation Techniques. Los Alamitos, CA: IEEE Computer Society, 2007: 327–338
[216] Dubach C, Jones T M, O’Boyle M F P. Rapid early-stage microarchitecture design using predictive models[C]//Proc of the 27th Int Conf on Computer Design. Los Alamitos, CA: IEEE Computer Society, 2009: 297–304
[217] Özisikyilmaz B, Memik G, Choudhary A N. Machine learning models to predict performance of computer system design alternatives[C]//Proc of the 37th Int Conf on Parallel Processing. Los Alamitos, CA: IEEE Computer Society, 2008: 495–502
[218] Özisikyilmaz B, Memik G, Choudhary A N. Efficient system design space exploration using machine learning techniques[C]//Proc of the 45th Design Automation Conf. New York: ACM, 2008: 966–969
[219] Ghosh A, Givargis T. Cache optimization for embedded processor cores: An analytical approach[J]. ACM Transactions on Design Automation of Electronic Systems, 2004, 9(4): 419−440 doi: 10.1145/1027084.1027086
[220] Li Sheng, Chen Ke, Ahn J H, et al. CACTI-p: Architecture-level modeling for SRAM-based structures with advanced leakage reduction techniques[C]//Proc of the 30th Int Conf on Computer-Aided Design. Los Alamitos, CA: IEEE Computer Society, 2011: 694–701
[221] Li Sheng, Ahn J H, Strong R D, et al. McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures[C]//Proc of the 42nd Annual IEEE/ACM Int Symp on Microarchitecture. New York: ACM, 2009: 469–480
[222] Karkhanis T S, Smith J E. A first-order superscalar processor model[C]//Proc of the 31st Annual Int Symp on Computer Architecture. Piscataway, NJ: IEEE, 2004: 338–349
[223] Genbrugge D, Eyerman S, Eeckhout L. Interval simulation: Raising the level of abstraction in architectural simulation[C/OL]//Proc of the 16th Int Symp on High-Performance Computer Architecture. Piscataway, NJ: IEEE, 2010[2023-12-18]. https://ieeexplore.ieee.org/document/5416636
[224] Breughe M, Eyerman S, Eeckhout L. A mechanistic performance model for superscalar in-order processors[C]//Proc of the 2012 IEEE Int Symp on Performance Analysis of Systems & Software. Los Alamitos, CA: IEEE Computer Society, 2012: 14–24
[225] Van den Steen S, Eyerman S, De Pestel S, et al. Analytical processor performance and power modeling using micro-architecture independent characteristics[J]. IEEE Transactions on Computers, 2016, 65(12): 3537−3551
[226] De Pestel S, Van den Steen S, Akram S, et al. RPPM: Rapid performance prediction of multithreaded workloads on multicore processors[C]//Proc of the 2019 IEEE Int Symp on Performance Analysis of Systems and Software. Piscataway, NJ: IEEE, 2019: 257–267
[227] Jongerius R, Mariani G, Anghel A, et al. Analytic processor model for fast design-space exploration[C]//Proc of the 33rd IEEE Int Conf on Computer Design. Los Alamitos, CA: IEEE Computer Society, 2015: 411–414
[228] Jongerius R, Anghel A, Dittmann G, et al. Analytic multi-core processor model for fast design-space exploration[J]. IEEE Transactions on Computers, 2018, 67(6): 755−770 doi: 10.1109/TC.2017.2780239
[229] Kwon J, Carloni L P. Transfer learning for design-space exploration with high-level synthesis[C]//Proc of the 2nd ACM/IEEE Workshop on Machine Learning for CAD. New York: ACM, 2020: 163–168
[230] Zhang Zheng, Chen Tinghuan, Huang Jiaxin, et al. A fast parameter tuning framework via transfer learning and multi-objective Bayesian optimization[C]//Proc of the 59th ACM/IEEE Design Automation Conf. New York: ACM, 2022: 133–138
[231] Zhang Keyi, Asgar Z, Horowitz M. Bringing source-level debugging frameworks to hardware generators[C]//Proc of the 59th ACM/IEEE Design Automation Conf. New York: ACM, 2022: 1171–1176
[232] Xiao Qingcheng, Zheng Size, Wu Bingzhe, et al. HASCO: Towards agile hardware and software co-design for tensor computation[C]//Proc of the 48th Annual Int Symp on Computer Architecture. Piscataway, NJ: IEEE, 2021: 1055–1068
[233] Esmaeilzadeh H, Ghodrati S, Kahng A B, et al. Physically accurate learning-based performance prediction of hardware-accelerated ML algorithms[C]//Proc of the 4th ACM/IEEE Workshop on Machine Learning for CAD. New York: ACM, 2022: 119–126
[234] Sun Qi, Chen Tinghuan, Liu Siting, et al. Correlated multi-objective multi-fidelity optimization for HLS directives design[C]//Proc of the 24th Design, Automation & Test in Europe Conf & Exhibition. Piscataway, NJ: IEEE, 2021: 46–51
[235] Wu Y N, Tsai P A, Parashar A, et al. Sparseloop: An analytical approach to sparse tensor accelerator modeling[C]//Proc of the 55th IEEE/ACM Int Symp on Microarchitecture. Piscataway, NJ: IEEE, 2022: 1377–1395
[236] Huang Qijing, Kang M, Dinh G, et al. CoSA: Scheduling by constrained optimization for spatial accelerators[C]//Proc of the 48th Annual Int Symp on Computer Architecture. Piscataway, NJ: IEEE, 2021: 554–566
[237] Mei Linyan, Houshmand P, Jain V, et al. ZigZag: Enlarging joint architecture-mapping design space exploration for DNN accelerators[J]. IEEE Transactions on Computers, 2021, 70(8): 1160−1174 doi: 10.1109/TC.2021.3059962
-
期刊类型引用(0)
其他类型引用(2)