高级检索
    唐 亮 骆祖莹 赵国兴 杨 旭. 利于GPU计算具有线性并行度的P/G网SOR求解算法[J]. 计算机研究与发展, 2013, 50(7): 1491-1500.
    引用本文: 唐 亮 骆祖莹 赵国兴 杨 旭. 利于GPU计算具有线性并行度的P/G网SOR求解算法[J]. 计算机研究与发展, 2013, 50(7): 1491-1500.
    Tang Liang, Luo Zuying, Zhao Guoxing, and Yang Xu. SOR-Based P/G Solving Algorithm of Linear Parallelism for GPU Computing[J]. Journal of Computer Research and Development, 2013, 50(7): 1491-1500.
    Citation: Tang Liang, Luo Zuying, Zhao Guoxing, and Yang Xu. SOR-Based P/G Solving Algorithm of Linear Parallelism for GPU Computing[J]. Journal of Computer Research and Development, 2013, 50(7): 1491-1500.

    利于GPU计算具有线性并行度的P/G网SOR求解算法

    SOR-Based P/G Solving Algorithm of Linear Parallelism for GPU Computing

    • 摘要: 近年来电子设计自动化(EDA)研究人员尝试利用图形处理器(graphic processing unit, GPU)提供的高性能计算能力对IC参数分析进行加速研究.为了利用GPU进行电源线/地线网络(power/ground network, P/G网)快速分析,设计了一种基于经典的连续过松弛(successive over-relaxation, SOR)算法的高效P/G网分析并行算法.基于GPU并行计算加速原理,此算法进行了如下改进:1)采用红-黑次序的松弛策略.将所有的节点分为红黑两类,红色节点的所有邻点只有黑色节点、黑色节点的所有邻点只有红色节点,红色节点与黑色节点交替松弛,保证了GPU并行计算中的数据一致性.对于具有N个节点的P/G网而言,一次红色节点或黑色节点松弛可以同时对N/2个节点进行松弛操作,即理论上可以同时启动N/2个并行线程.2)优化数据结构.实现了对数据空间的合并访问,以保证对GPU全局存储空间的最优访问.3)在共享存储器内通过并行归约对松弛标记进行快速统计,同时利用zero-copy技术进行松弛标记的快速拷贝,以快速决定是否继续松弛.大量的实验结果表明:与单线程的CPU程序相比,此算法的加速倍数随GPU所提供物理线程的数目增加而线性增加,可以获得最大242倍的加速效果,是目前EDA研究领域中加速效果最好的GPU算法.

       

      Abstract: Recently some EDA researchers try using the high computing performance of the graphic processing unit (GPU) to speed up IC parameter analysis. In order to apply GPU for efficient analyses of power/ground(P/G) networks, a novel efficient parallel analysis algorithm is proposed based on the traditional successive over-relaxation (SOR) algorithm. According to the accelerating principles of GPU parallel computing, the algorithm has made the following improvements. 1) Use odd/even based red/black strategy to classify all nodes so that all neighbors of a red node are black ones and a black node has all red neighbors. Red nodes and black nodes are relaxed alternatively for data consistency of GPU computing. As for a P/G network of N nodes, one relaxation process of red or black nodes will relax N/2 nodes, which means that the red/black SOR will implement N/2 threads at one time from the viewpoint of theory. 2) Optimize the data structure to implement data coalesced access. 3) Local shared memory is used to parallel reduce relaxation ending flags and the compacted flags are zero copied to the main memory of the computer system at high speed. In turn, CPU can decide fast whether to activate the next round of relaxation kernels or end the relaxation. A large number of experiments demonstrate that compared with the single CPU thread, the speedup times of our algorithm increase linearly as GPU physical threads increase, and the algorithm can provide 242X speedup at maximum. Thus, according to our best knowledge, the algorithm provides the best GPU accelerating efficiency among all present EDA researches.

       

    /

    返回文章
    返回