-
摘要:
在高性能处理器开发中,准确而快速的性能估算是设计决策和参数选择的基础. 现有工作通过采样算法和RTL的体系结构检查点加速了处理器RTL仿真,使得在数天内测算复杂高性能处理器的SPECCPU等基准测试的性能成为可能. 但是数天的迭代周期仍然过长,性能测算周期仍然有进一步缩短的空间. 在处理器RTL仿真过程中,预热过程的时间占比很大. HyWarm框架的提出是为了加速性能测算过程中的预热过程. HyWarm通过微结构模拟器分析负载预热需求,为每个负载定制预热方案. 对于缓存预热需求较大的负载,HyWarm通过总线协议进行RTL缓存的功能预热;对于RTL全细节仿真,HyWarm利用CPU分簇和LJF调度缩短最大完成时间. HyWarm相较于现有最好的RTL采样仿真方法,在与基准方法准确率相似的前提下,将仿真完成时间缩短了53%.
Abstract:When developing high-performance processors, accurate and fast performance estimation is the basis for design decisions and parameter exploration. Prior work accelerates processor RTL emulation through workload sampling and architectural checkpoints for RTL, which makes it possible to estimate the performance of benchmarks such as SPECCPU running on complex high-performance processors within a few days. However, waiting a few days for performance results is still too long for architecture iteration, and there is still room for further shortening the performance measurement cycle. During RTL emulation of processors, the warm up phase consumes a significant amount of time. As a solution to expedite the warm up phase during performance evaluation, the HyWarm framework is developed. HyWarm analyzes the warm up demand of workloads with the micro-architectural simulator, and adaptively customizes the warm up scheme for each workload. For workloads with high warm up demand on caches, HyWarm performs functional warm up through the caches’ bus protocol on RTL. For detailed emulation part, HyWarm utilizes CPU clustering and LJF scheduling to reduce the maximum completion time. Compared with the best existing sampling-based RTL emulation method, HyWarm reduces the emulation completion time by 53% under the premise of similar accuracy to the baseline method.
-
在实现分布式数据库的技术方案上,业界存在不同的选择. 第一种方式需要对应用系统进行拆分,通过分库分表将原本单个数据库管理的数据分散到多个集中式数据库. 分库分表方案要求应用系统重构,跨库访问效率较低,关系数据库的重要功能,如外键、全局唯一性约束、全局索引等无法使用. 第二种方式是对传统集中式关系数据库进行分布式改造,增加分布式事务处理,小规模集群部署下的自动故障恢复等功能. 这类分布式数据库由于存储系统、事务处理和SQL优化器等源自集中式架构,在分布式场景下面临功能和性能上的诸多限制. 第三种方式是从头开始设计和实现一个原生分布式关系数据库,将分布式作为基本特性融入存储系统、事务处理和SQL优化器等关键组件. 相比前两种方案,原生分布式数据库在高可用、数据一致性、事务性能、弹性伸缩、快速无损的故障恢复等方面有着更大的优势.
OceanBase是一个从头开始设计与实现的分布式关系数据库系统. OceanBase因淘宝而诞生,因支付宝而发展和壮大,如今已在金融、政务、通信和互联网等领域得到广泛应用. 由OceanBase首席科学家阳振坤领衔的分布式数据库研发团队实现了多项技术创新和突破,该团队撰写的论文“OceanBase分布式关系数据库架构与技术”介绍了OceanBase的分布式架构,分布式事务处理、存储引擎、SQL优化、多租户机制等关键技术 ,具体总结如下:
1)设计了强一致、高可用、可扩展的分布式事务处理机制,实现了单机/单机房故障的自动、无损、快速的故障恢复;
2)提出了单机/分布式一体化关系数据库架构,实现了关系数据库容量和处理能力从单机数据库到分布式数据库的无缝切换和伸缩;
3)实现了关系数据库的性能无损的高倍率数据压缩,论文实验展示了数据压缩倍率是主流关系数据库的3倍甚至更高;
4)实现了单数据库系统同时支持高性能事务处理和实时分析处理,典型场景的事务处理性能和分析处理性能都高于MySQL.
OceanBase是迄今为止唯一同时获得了TPC-C和TPC-H性能榜首的数据库. 尽管关系数据库的提出已经过去了半个世纪之久,真正意义上的分布式关系数据库时代才刚刚开始,论文不仅展示了OceanBase采用的分布式数据库关键技术,也对未来分布式数据库的发展方向提出了展望. 我相信,这篇论文能引发很多关于数据库发展方向的思考,对于从事相关研究和开发的工程技术人员和数据库应用领域的专业人士都有重要的参考价值.
评述专家
周傲英,教授,博士生导师. 主要研究方向为Web数据管理、数据密集型计算、内存集群计算、分布事务处理、大数据基准测试和性能优化.亮点论文
阳振坤,杨传辉,韩富晟,王国平,杨志丰,成肖君. OceanBase分布式关系数据库架构与技术[J]. 计算机研究与发展,2024,61(3):540−554. DOI:10.7544/issn1000-1239.202330835
-
表 1 在AMD EPYC 7H12 64核服务器上运行不同并行任务数的Verilator的仿真速度
Table 1 Emulation Speed of Verilator with Different Parallelism on AMD EPYC 7H12 Server with 64 Cores
仿真速度/IPS 4线程单任务 4线程16任务 满载性能损失 单任务 2153.13 1189.31 每核 538.28 297.33 45% 表 2 常用的RTL性能评估方法对比
Table 2 Comparison of Commonly Used RTL Performance Evaluation Methods
RTL性能评估方法 仿真频率 典型价格/CNY 是否可租用 典型可容纳设计 RTL软件仿真器 ⩽1kHz 5−10万 是 可容纳商业级SoC 公有云FPGA \leqslant 100MHz 每天240−3600 是 Boom处理器 私有FPGA \leqslant 100MHz \leqslant 40万 否 香山处理器 硬件仿真加速器 \leqslant 1MHz >1000万 否 可容纳商业级SoC 表 3 服务器低负载时Verilator仿真的多线程扩展效率对比
Table 3 Comparison of Multi-threading Scaling Efficiency of Verilator Emulation When Server Load is Low
线程数量 1 4 8 16 每核 IPS 190.82 538.28 450.94 321.27 表 4 服务器满载时Verilator仿真的多线程扩展效率对比
Table 4 Comparison of Multi-threading Scaling Efficiency of Verilator Emulation When Server is Fully Loaded
线程数量 4 8 16 每核IPS 297.33 389.27 335.50 表 5 微结构配置
Table 5 Microarchitectural Configuration
部件 配置 分支预测器 16KB TAGE-SC + ITTAGE + RAS + 4KB BTB 一级数据缓存 128KB, 8路数据缓存 一级指令缓存 128KB, 8路指令缓存 二级缓存 1MB 8路 非包含 三级缓存 6MB 6路 非包含 一级指令TLB 40项 一级数据TLB 136(128 × 4k页 + 8 × 2M页) 二级TLB 2K项 取指宽度 每周期8×4B指令 译码重命名宽度 每周期6条指令 ROB/LQ/SQ 256/80/64 物理寄存器堆 192整数;192浮点 执行单元 Int: 4×ALU, 2×MDU, 1×Misc
Mem: 2×Ld AGU, 2×St AGU
Float: 4×FMA, 2×Misc表 6 预热配置
Table 6 Warm up Configurations
方案 功能预热的
M条指令数全细节预热的
M条指令数性能测量的
M条指令数0+100 100 5 0+50 50 5 0+25 20 5 0+10 10 5 0+5 5 5 Ada 100−DW 自适应(DW) 5 FixedFW
(95+5)95 5 5 表 7 不同功能预热方案的总仿真时长对比 h
Table 7 Comparison of Total Simulation Time for Different Functional Warm up Schemes
子项 0+5 0+10 0+25 FixedFW (95+5) Ada GemsFDTD 0.37 0.55 1.04 0.42 0.29 astar.bi 0.57 0.91 1.65 0.58 0.64 astar.ri 0.69 0.95 1.97 0.66 0.79 bwaves 0.57 0.92 1.68 0.60 0.43 bzip2.chi 0.30 0.43 0.81 0.30 0.22 bzip2.com 1.00 1.52 2.71 1.01 0.72 bzip2.htm 0.30 0.43 0.92 0.34 0.31 bzip2.lib 0.30 0.42 0.89 0.30 0.21 bzip2.pro 1.01 1.60 3.19 0.98 0.68 bzip2.sou 0.95 1.49 2.92 1.08 0.96 cactusADM 0.41 0.60 1.35 0.47 0.32 calculix 0.35 0.60 1.12 0.36 0.26 dealII 0.33 0.51 1.10 0.40 1.20 gamess.cy 0.33 0.49 1.00 0.36 3.46 gamess.gra 0.35 0.51 1.06 0.38 1.09 gamess.tri 0.33 0.50 0.92 0.34 1.10 gcc.166 0.42 0.61 1.33 0.48 1.34 gcc.200 0.90 1.17 2.72 0.89 0.71 gcc.cpde 0.54 0.86 1.63 0.62 1.75 gcc.expr2 0.58 0.86 1.76 0.63 1.03 gcc.expr 0.63 0.89 1.75 0.61 0.70 gcc.g23 0.55 0.76 1.54 0.66 0.43 gcc.s04 0.57 0.93 1.66 0.67 0.69 gcc.scil 0.90 1.10 2.34 0.94 2.48 gcc.type 0.92 1.44 2.62 0.91 1.57 gobmk.13x 0.94 1.51 3.08 0.99 1.66 gobmk.nn 0.85 1.28 2.61 0.92 0.61 gobmk.sco 0.97 1.34 2.70 0.98 0.66 gobmk.tr 0.95 1.30 2.63 0.87 0.98 gobmk.tr 0.71 1.07 2.26 0.73 1.17 gromacs 0.72 1.00 2.25 0.72 0.48 h264ref.f 0.44 0.58 1.21 0.47 0.45 h264ref.s 0.38 0.50 1.04 0.38 2.23 hmmer.nph 0.77 1.25 2.52 0.85 1.45 hmmer.re 0.80 1.21 2.43 0.92 0.79 lbm 0.67 1.02 2.08 0.74 0.57 leslie3d 0.51 0.78 1.43 0.51 0.35 libquantum 0.56 0.78 1.55 0.98 0.39 mcf 3.14 4.18 9.35 3.34 2.32 milc 0.42 0.59 1.26 0.46 0.34 namd 0.52 0.77 1.38 0.48 0.31 omnetpp 1.08 1.66 3.19 1.27 1.06 perl.che 0.46 0.68 1.29 0.47 0.83 perl.di 0.55 0.83 1.37 0.52 1.56 perl.spli 0.43 0.66 1.31 0.43 0.32 povray 0.55 0.88 1.65 0.54 5.39 sjeng 0.72 1.05 2.00 0.67 2.14 soplex.p 1.15 1.59 3.57 1.36 0.87 soplex.r 1.11 1.70 3.05 1.14 0.71 sphinx3 0.46 0.72 1.33 0.59 1.49 tonto 0.37 0.55 1.19 0.41 0.48 xalancbmk 0.89 1.42 2.56 1.17 1.03 zeusmp 0.51 0.75 1.53 0.58 0.39 总计 35.8 52.7 105.5 38.5 54.4 注:黑体数字表示mcf是25M全细节预热下的时间最长的子项,而povray是Ada配置下的时间最长子项. 表 8 不同方案准确率对比
Table 8 Accuracy Comparison of Different Schemes
% 方案 CPI 分支MPKI L1MP Ada 99.6 91.6 95.1 0+50 99.8 98.9 97.5 0+25 99.7 94.1 91.3 0+10 99.1 85.2 82.8 表 9 WarmProfiler的分支MPKI预测误差(增高)
Table 9 Branch MPKI Prediction Error Caused by WarmProfiler (increase)
子项 完美预测
MPKIMPKI
增高MPKI
增高百分比/%gcc_expr2 0.443 0.177 39.9 gcc_g23 0.973 0.172 17.7 tonto 0.506 0.117 23.1 gamess_g 0.430 0.112 26.1 gcc_scilab 7.687 0.090 1.2 xalancbmk 2.003 0.079 3.9 gcc_s04 0.163 0.070 42.8 perl_di 0.669 0.066 9.8 h264ref_f 0.042 0.064 151.9 astar_rivers 3.422 0.053 1.6 注:计算MPKI误差的方法是用WarmProfiler指导预热所得的MPKI减去用RTL的真实预热需求进行预热所得到的MPKI. 黑体数字标识出了MPKI误差超过0.1的子项. 表 10 簇的数量对调度均衡度的影响
Table 10 Impact of Cluster Count on Scheduling Balance
调度均衡度 随机调度 LJF调度 4 簇 × 16核 0.93 0.99 8 簇 × 8核 0.76 0.98 16 簇 × 4核 0.54 0.63 表 11 LJF调度与随机调度的仿真时间对比
Table 11 Comparison of Simulation Time Between LJF Scheduling and Random Scheduling
仿真 随机调度/h LJF调度/h 提升率/% Ada,8核×8簇 8.71 6.91 20.61 Ada,8核×16簇 6.25 5.38 13.89 25+5,8核×8簇 15.98 13.54 15.26 25+5,8核×16簇 11.29 9.35 17.23 注:Ada结合LJF调度是HyWarm提出的方案;25+5结合随机调度是基线方案. 表 12 采用模拟器IPC和RTL的真实IPC指导LJF调度的最大完成时间
Table 12 Maximum Completion Time of LJF Scheduling Guided by Simulator IPC and Real IPC of RTL
h Ada仿真 模拟器预测IPC 真实IPC 8核 × 4 簇 13.77 13.67 8核 × 8 簇 6.91 6.92 8核 × 16 簇 5.38 5.38 注:黑体数字标识出8簇下模拟器预测IPC获得了更短的完成时间,这是因为LJF是贪心算法,完成时间的预测误差可能导致更好的调度结果. -
[1] Bachrach J, Vo H, Richards B, et al. Chisel: Constructing hardware in a scala embedded language[C] //Proc of the 49th Annual Design Automation Conf. New York: ACM, 2012: 1212–1221
[2] Nikhil R. Bluespec systemVerilog: Efficient, correct RTL from high-level specifications[C] //Proc of the 2nd Int Conf on Formal Methods and Models for Co-Design. Piscataway, NJ: IEEE, 2004: 69–70
[3] Asanovic K, Avizienis R, Bachrach J, et al. The Rocket Chip Generator[R]. Berkeley, CA: UC Berkeley, 2016
[4] Xu Yinan, Yu Zihao, Tang Dan, et al. Towards developing high performance RISC-V processors using agile methodology[C] //Proc of the 55th Annual Int Symp on Microarchitecture. Piscataway, NJ: IEEE, 2022: 1178–1199
[5] Lockhart D, Zibrat G, Batten C. PyMTL: A unified framework for vertically integrated computer architecture research[C] //Proc of the 47th Annual Int Symp on Microarchitecture (MICRO). Los Alamitos, CA: IEEE Computer Society, 2014: 280–292
[6] Celio C, Chiu P F, Asanović K, et al. Broom: An open-source out-of-order processor with resilient low-voltage operation in 28-nm CMOS[J]. IEEE Micro, 2019, 39(2): 52−60 doi: 10.1109/MM.2019.2897782
[7] Celio C, Patterson D, Asanovi K. The Berkeley Out-of-Order Machine ( BOOM ) Design Specification[R]. Berkeley, CA: UC Berkeley, 2016
[8] 王凯帆,徐易难,余子濠等. 香山开源高性能 RISC-V 处理器设计与实现[J]. 计 算 机 研 究 与 发 展,2023,60(3):476−493 Wang Kaifan, Xu Yinan, Yu Zihao, et al. XiangShan open-source high performance RISC-V processor design and implementation[J]. Journal of Computer Research and Development, 2023, 60(3): 476−493 (in Chinese)
[9] Veripool. Verilator, the fastest Verilog/SystemVerilog simulator. [EB/OL]. [2022-10-20]. https://www.veripool.org/verilator/
[10] Sherwood T, Perelman E, Calder B. Basic block distribution analysis to find periodic behavior and simulation points in applications[C] //Proc of the 2001 Int Conf on Parallel Architectures and Compilation Techniques. Los Alamitos, CA: IEEE Computer Society, 2001: 3–14
[11] Wunderlich R E, Wenisch T F, Falsafi B, et al. SMARTS: Accelerating microarchitecture simulation via rigorous statistical sampling[C] //Proc of the 30th Annual Int Symp on Computer Architecture, ISCA. Los Alamitos, CA: IEEE Computer Society, 2003: 84–95
[12] Binkert N, Beckmann B, Black G, et al. The gem5 simulator[C] //Proc of the 16th Int Conf on Architectural Support for Programming Languages and Operating Systems.New York: ACM, 2011, 39(2): 1–7
[13] Kabylkas N, Thorn T, Srinath S, et al. Effective processor verification with logic fuzzer enhanced co-simulation[C] //Proc of the 54th Annual Int Symp on Microarchitecture. New York: ACM, 2021: 667–678
[14] Eeckhout L, Luo Y, De Bosschere K, et al. BLRL: Accurate and efficient warmup for sampled processor simulation[J]. Computer Journal, 2005, 48(4): 451−459 doi: 10.1093/comjnl/bxh103
[15] Wenisch T F, Wunderlich R E, Falsafi B, et al. TurboSMARTS: Accurate microarchitecture simulation sampling in minutes[C] //Proc of the Int Conf on Measurements and Modeling of Computer Systems.New York: ACM, 2005: 408–409
[16] Nikoleris N, Sandberg A, Hagersten E, et al. CoolSim: Statistical techniques to replace cache warming with efficient, virtualized profiling[C] //Proc of the Int Conf on Embedded Computer Systems: Architectures, Modeling and Simulation. Piscataway, NJ: IEEE, 2017: 106–115
[17] Nikoleris N, Eeckhout L, Hagersten E, et al. Directed statistical warming through time traveling[C] //Proc of the 52nd Annual Int Symp on Microarchitecture. New York: ACM, 2019: 1037–1049
[18] Patil H, Isaev A, Heirman W, et al. ELFies: executable region checkpoints for performance analysis and simulation[C] // Proc of the Int Symp on Code Generation and Optimization. Piscataway, NJ: IEEE, 2021: 126–136
[19] Haskins J W, Skadron K. Memory reference reuse latency: accelerated warmup for sampled microarchitecture simulation[C] //Proc of the Int Symp on Performance Analysis of Systems and Software. Los Alamitos, CA: IEEE Computer Society, 2003: 195–203
[20] Yue Luo, John L K, Eeckhout L. Self-monitored adaptive cache warm-up for microprocessor simulation[C] //Proc of the 16th Symp on Computer Architecture and High Performance Computing. Los Alamitos, CA: IEEE Computer Society, 2004: 10–17
[21] ARM. Learn the architecture-introducing AMBA CHI[EB/OL]. [2022-11-24]. https://developer.arm.com/documentation/102407/0100
[22] Cook H, Terpstra W, Lee Y. Diplomatic design patterns: A TileLink case study[C] //Proc of the First Workshop on Computer Architecture Research with RISC-V. Berkeley, CA: UC Berkeley, 2017: 23
[23] Coffman E G, Sethi R. A generalized bound on LPT sequencing[C] //Proc of the Int Symp on Computer Modeling, Measurement and Evaluation. New York: ACM, 1976: 306–310
[24] Xiao Xin. A direct proof of the 4/3 bound of LPT scheduling rule[C] //Proc of Int Conf on Frontiers of Manufacturing Science and Measuring Technology. Amsterdam, The Netherlands: Atlantis, 2017: 486–489
[25] Tan Zhangxi, Waterman A, Cook H, et al. A case for FAME: FPGA architecture model execution[C] //Proc of the 37th Int Symp on Computer Architecture. New York: ACM, 2010: 290–301
[26] Karandikar S, Mao H, Kim D, et al. FireSim : FPGA-accelerated cycle-exact scale-out system simulation in the public cloud[C] //Proc of the 45th Annual Int Symp on Computer Architecture. Los Alamitos, CA: IEEE Computer Society, 2018: 29-42
[27] Kim D, Izraelevitz A, Celio C, et al. Strober: Fast and accurate sample-based energy simulation for arbitrary RTL[C] //Proc of the 43rd Int Symp on Computer Architecture. Los Alamitos, CA: IEEE Computer Society, 2016: 128–139
[28] Hung W N N, Sun R. Challenges in large FPGA-based logic emulation systems[C] //Proc of the Int Symp on Physical Design. New York: ACM, 2018: 26–33
[29] Agnesina A, Lim S K, Lepercq E, et al. Improving FPGA-based logic emulation systems through machine learning[J].ACM Trans on Design Automation of Electronic Systems, 2020, 25(5): 46:1-46:20
[30] Cadence. Palladium Emulation [EB/OL]. [2022-12-22]. https://www.cadence.com/en_US/home/tools/system-design-and-verification/emulation-and-prototyping/palladium.html
[31] Siemens Software. Veloce Hardware-assisted Verification System[EB/OL]. [2023-01-08]. https://eda.sw.siemens.com/en-US/ic/veloce/
[32] Synopsys. Synopsys Emulation Systems[EB/OL]. [2023-01-08]https://www.synopsys.com/verification/emulation.html
[33] Beamer S, Donofrio D. Efficiently exploiting low activity factors to accelerate RTL simulation[C] //Proc of the Design Automation Conf. Piscataway, NJ: IEEE, 2020: 1-6
[34] Sandberg A, Nikoleris N, Carlson T E, et al. Full speed ahead: Detailed architectural simulation at near-native speed[C] //Proc of the Int Symp on Workload Characterization. Los Alamitos, CA: IEEE Computer Society, 2015: 183–192
[35] Hassani S, Southern G, Renau J. LiveSim: Going live with microarchitecture simulation[C] //Proc of the Int Symp on High-Performance Computer Architecture. Los Alamitos, CA: IEEE Computer Society, 2016: 606–617
[36] Vengalam U K R, Sharma A, Huang M C. LoopIn: A Loop-Based Simulation Sampling Mechanism[C] //Proc of the Int IEEE Symp on Performance Analysis of Systems and Software. Piscataway, NJ: IEEE, 2022: 224–226
[37] Carlson T E, Heirman W, Van Craeynest K, et al. BarrierPoint: Sampled simulation of multi-threaded applications[C] //Proc of the Int Symp on Performance Analysis of Systems and Software. Los Alamitos, CA: IEEE Computer Society, 2014: 2–12
[38] Grass T, Carlson T E, Rico A, et al. Sampled simulation of task-based programs[J]. IEEE Trans on Computers, 2019, 68(2): 255−269 doi: 10.1109/TC.2018.2860012
[39] Ardestani E K, Renau J. ESESC: A fast multicore simulator using time-based sampling[C] //Proc of the Int Symp on High-Performance Computer Architecture. Los Alamitos, CA: IEEE Computer Society, 2013: 448–459
[40] Pestel S De, Eyerman S, Eeckhout L. Micro-architecture independent branch behavior characterization[C] //Proc of the Int Symp on Performance Analysis of Systems and Software. Los Alamitos, CA: IEEE Computer Society, 2015: 135–144
[41] RISC-V International. RISC-V Debug Support Version 1.0.0-STABLE[EB/OL]. [2023-01-26]. https://github.com/riscv/riscv-debug-spec
[42] Standard Performance Evaluation Corporation. SPEC CPU® 2006[EB/OL]. [2023-01-26]. https://www.spec.org/cpu2006/
[43] Barr K C, Pan H, Zhang M, et al. Accelerating multiprocessor simulation with a memory timestamp record[C] //Proc of the Int Symp on Performance Analysis of Systems and Software. Los Alamitos, CA: IEEE Computer Society, 2005: 66–77
[44] Black B, Shen J P. Calibration of microprocessor performance models[J]. Computer, 1998, 31(5): 59−65 doi: 10.1109/2.675637
[45] Barr K C, Pan H, Zhang M, et al. Accelerating multiprocessor simulation with a memory timestamp record[C] //Proc of the Int Symp on Performance Analysis of Systems and Software. Austin, Texas, USA: IEEE Computer Society, 2005: 66–77.
[46] Seznec A. A 256 Kbits L-TAGE branch predictor[J]. Journal of Instruction-Level Parallelism Special Issue: The Second Championship Branch Prediction Competition, 2007, 9: 1−6
[47] Predictors T B, Irisa I. TAGE-SC-L Branch Predictors [J]. 5th JILP Workshop on Computer Architecture Competitions: Championship Branch Prediction, 2016:267175
[48] Järvelin K, Kekäläinen J. Cumulated gain-based evaluation of IR techniques[J]. ACM Transaction on Information Systems, 2002, 20(4): 422−446 doi: 10.1145/582415.582418
[49] Khan T A, Brown N, Sriraman A, et al. Twig: Profile-guided BTB prefetching for data center applications[C] //Proc of the 54th Annual Int Symp on Microarchitecture. New York: ACM, 2021: 816–829
[50] Qureshi M K, Patt Y N. Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches[C] //Proc of the 43rd Annual Int Symp on Microarchitecture. Los Alamitos, CA: IEEE Computer Society, 2006: 423–432
[51] Delimitrou C, Kozyrakis C. IBench: Quantifying interference for datacenter applications[C] //Proc of the Int Symp on Workload Characterization. Los Alamitos, CA: IEEE Computer Society, 2013: 23–33
[52] Leverich J, Kozyrakis C. Reconciling high server utilization and sub-millisecond quality-of-service[C] //Proc of the European Conf on Computer Systems. New York: ACM, 2014: 1-14
[53] Muralidhara S P, Subramanian L, Mutlu O, et al. Reducing memory interference in multicore systems via application-aware memory channel partitioning[C] //Proc of the 44th Annual Int Symp on Microarchitecture. New York: ACM, 2011: 374–385
[54] Kasture H, Sanchez D. Ubik: Efficient cache sharing with strict QoS for latency-critical workloads[C] //Proc of the Int Conf on Architectural Support for Programming Languages and Operating Systems. New York: ACM, 2014: 729–742
[55] Ma Jiayue, Sui Xiufeng, Sun Ninghui, et al. Supporting differentiated services in computers via programmable architecture for resourcing-on-demand (PARD)[C] //Proc of the Int Conf on Architectural Support for Programming Languages and Operating Systems. New York: ACM, 2015, 50(4): 131–143
[56] Krause K L, Shen V Y, Schwetman H D. Analysis of several task-scheduling algorithms for a model of multiprogramming computer systems[J]. Journal of the ACM, 1975, 22(4): 522−550 doi: 10.1145/321906.321917
[57] Hochbaum D S, Shmoys D B. Polynomial approximation scheme for scheduling on uniform processors: Using the dual approximation approach[J]. SIAM Journal on Computing, 1988, 17(3): 539−551 doi: 10.1137/0217033
[58] Horowitz E, Sahni S. Exact and approximate algorithms for scheduling nonidentical processors[J]. Journal of the ACM, 1976, 23(2): 317−327 doi: 10.1145/321941.321951
[59] Graham, Ronald L. Bounds for certain multiprocessing anomalies[J]. Bell System Technical Journal, 1966, 45(9): 1563−1581 doi: 10.1002/j.1538-7305.1966.tb01709.x
[60] Sifive. Block-Inclusivecache-Sifive[EB/OL]. [2023-01-25]. https://github.com/sifive/block-inclusivecache-sifive
-
期刊类型引用(0)
其他类型引用(1)