ISSN 1000-1239 CN 11-1777/TP

• 论文 • 上一篇    下一篇

Godson-T众核体系结构上的Broadcast性能优化

包尔固德1,2 李伟生1 范东睿2 杨扬2,3 马啸宇2   

  1. 1(北京交通大学软件学院 北京 100044) 2(中国科学院计算技术研究所计算机系统结构重点实验室 北京 100190) 3(北京交通大学计算机与信息技术学院 北京 100044) (baoergude@gmail.com)
  • 出版日期: 2010-03-15

An Optimization of Broadcast on Godson-T Many-Core System Architecture

Bao Ergude1,2, Li Weisheng1, Fan Dongrui2, Yang Yang2,3, and Ma Xiaoyu2   

  1. 1(School of Software, Beijing Jiaotong University, Beijing 100044) 2(Key Laboratory of Computer System and Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190) 3(School of Computer Science and Information Technology, Beijing Jiaotong University, Beijing 100044)
  • Online: 2010-03-15

摘要: Godson-T是中国科学院计算技术研究所计算机系统结构重点实验室先进微系统组正在研制开发的适合于超深亚微米工艺实现的大规模片上众核系统.Godson-T片上存储的单端口结构节省了芯片面积但制约了共享数据的读取效率.直接在Godson-T上实现传统的Broadcast算法需要大量的同步互斥开销,无法达到很好的性能提升.基于Godson-T体系结构,对数据共享的重要并行算法Broadcast进行优化,提高了Godson-T体系结构下的数据共读的效率.主要采取了以下3项技术:消除大规模的线程同步,建立源地址到目的地址的映射表和用汇编语言实现Broadcast的核心部分.优化后Broadcast在小核数为32时即可达到5.8倍加速比.

关键词: Godson-T, 众核, Broadcast, 同步, 互斥, 共读, 映射表

Abstract: Godson-T is a large scale many-core system architecture to be implemented by ultra-deep submicron MOS technology under development by the Group of Advanced Microsystem in the Key Laboratory of Computer System and Architecture of the Institute of Computer Technology, Chinese Academy of Sciences. The single port design of Godson-T on-chip memory saves the chips total area but limits the efficiency of data sharing. Broadcast is a basic parallel algorithm used to accelerate data sharing process, but implementing the traditional algorithm on Godson-T requires a large amount of synchronization and mutual exclusion expenses and therefore could not bring a good performance. Based on Godson-T system architecture, the authors optimize the important parallel algorithm Broadcast and enhance the efficiency of concurrent read. Three techniques are proposed for the optimization: eliminating bulk synchronization among threads, establishing mapping table between source addresses and destination addresses, and rearranging assembly instructions in Broadcast kernel. The first one reduces expenses of synchronizing a large amount of threads, the second one provides a quicker method for destination address search, and the last one fully makes use of the advantage of Godson-T architecture. The optimized Broadcast algorithm on Godson-T system architecture performs well; especially when core number is 32, the speedup of the algorithm can reach 5.8.

Key words: Godson-T, many-core, Broadcast, synchronization, mutual exclusion, concurrent read, mapping table, speedup