• 中国精品科技期刊
  • CCF推荐A类中文期刊
  • 计算领域高质量科技期刊T1类
Advanced Search
Pan Decai, Mou Di, Shang Jiaxing, Liu Dajiang. Memory Partitioning Optimization of CGRA Using Access Pattern Morphing[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202440079
Citation: Pan Decai, Mou Di, Shang Jiaxing, Liu Dajiang. Memory Partitioning Optimization of CGRA Using Access Pattern Morphing[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202440079

Memory Partitioning Optimization of CGRA Using Access Pattern Morphing

Funds: This work was supported by the National Natural Science Foundation of China (62274019) and the Natural Science Foundation of Chongqing City (CSTB2022NSCQ-MSX1017).
More Information
  • Author Bio:

    Pan Decai: born in 1999. Master. His main research interests include reconfigurable computing and compiler optimization

    Mou Di: born in 1999. Master candidate. His main research interest includes high performance hardware accelerator in deep neural network

    Shang Jiaxing: born in 1987. PhD, professor, PhD supervisor. Member of CCF and IEEE. His main research interests include social network analysis and industry big data mining

    Liu Dajiang: born in 1986. PhD, associate professor, PhD supervisor. Member of CCF and IEEE. His main research interests include reconfigurable computing, domain-specific accelerator, and compiler optimization

  • Received Date: February 01, 2024
  • Revised Date: August 11, 2024
  • Accepted Date: September 13, 2024
  • Available Online: December 11, 2024
  • With run-time configurable hardware, coarse-grained reconfigurable array (CGRA) is a potential platform to provide both program flexibility and energy efficiency for data-intensive applications. To exploit the access parallelism of the multi-bank memory, memory partitioning is usually introduced to CGRAs. However, existing work for memory partitioning on CGRAs either achieves the optimal partitioning solution with expensive addressing overheads or achieves area-and-energy efficient hardware at the sacrifice of more bank consumption. To this end, we propose an efficient memory partitioning approach for loop pipelining on CGRA via access pattern morphing. By performing a memory partitioning and scheduling co-optimization on multi-dimensional arrays, a memory partition-friendly access pattern is formed in the data domain such that it can be partitioned with a minimized number of all-one partitioning hyperplanes, resulting in both optimized partition factor and reduced addressing overhead. To solve the partitioning problem, firstly, we propose a backtracking-based scheduling algorithm to find the partition-friendly pattern with minimized initiation interval. Then, based on the partitioning result, we also propose an energy-area-efficient CGRA architecture by simplifying the address generators in load-store units. The experimental results show that our approach can achieve 1.25 times energy efficiency while keeping a moderate compilation time, as compared with the state-of-the-art method.

  • [1]
    Prabhakar R, Zhang Yaqi, Koeplinger D, et al. Plasticine: A reconfigurable architecture for parallel patterns[C]//Proc of the 44th Annual Int Symp on Computer Architecture. New York: ACM, 2017: 389−402
    [2]
    Nguyen Q M, Sanchez D. Fifer: Practical acceleration of irregular applications on reconfigurable architectures[C]//Proc of the 54th Annual IEEE/ACM Int Symp on Microarchitecture. New York: ACM, 2021: 1064−1077
    [3]
    Tan Chen, Agostini N B, Geng Tong, et al. DRIPS: Dynamic rebalancing of pipelined streaming applications on CGRAs[C]//Proc of the 28th Int Symp on High-Performance Computer Architecture (HPCA). Piscataway, NJ: IEEE, 2022: 304−316
    [4]
    Charitopoulos G, Vatsolakis C, Chrysos G, et al. A decoupled access-execute architecture for reconfigurable accelerators[C]//Proc of the 15th ACM Int Conf on Computing Frontiers. New York: ACM, 2018: 244−247
    [5]
    Ciricescu S, Essick R, Lucas B, et al. The reconfigurable streaming vector processor (RSVPTM)[C]//Proc of the 36th Annual Int Symp on Microarchitecture. Piscataway, NJ: IEEE, 2003: 141−150
    [6]
    Farahini N, Hemani A, Sohofi H, et al. Parallel distributed scalable runtime address generation scheme for a coarse grain reconfigurable computation and storage fabric[J]. Microprocessors and Microsystems, 2014, 38(8): 788−802 doi: 10.1016/j.micpro.2014.05.009
    [7]
    Ho C H, Kim S J, Sankaralingam K. Efficient execution of memory access phases using dataflow specialization[C]//Proc of the 42nd Annual Int Symp on Computer Architecture. New York: ACM, 2015: 118−130
    [8]
    Nowatzki T, Gangadhar V, Ardalani N, et al. Stream-dataflow acceleration[C]//Proc of the 44th Annual Int Symp on Computer Architecture. New York: ACM, 2017: 416−429
    [9]
    Smith J E. Decoupled access/execute computer architectures[J]. ACM SIGARCH Computer Architecture News, 1982, 10(3): 112−119 doi: 10.1145/1067649.801719
    [10]
    Dave S, Balasubramanian M, Shrivastava A. RAMP: Resource-aware mapping for CGRAs[C/OL]//Proc of the 55th Annual Design Automation Conf. New York: ACM, 2018[2023-12-01]. https://ieeexplore.ieee.org/document/8465892
    [11]
    Chatterjee S, Gilbert J R, Long F J E, et al. Generating local addresses and communication sets for data-parallel programs[J]. ACM SIGPLAN Notices, 1993, 28(7): 149−158 doi: 10.1145/173284.155348
    [12]
    Wang Yuxing, Li Peng, Zhang Peng, et al. Memory partitioning for multidimensional arrays in high-level synthesis[C/OL]//Proc of the 50th Annual Design Automation Conf. Piscataway, NJ: IEEE, 2013[2023-12-01]. https://ieeexplore.ieee.org/document/6560605
    [13]
    Wijerathne D, Li Zhaoying, Karunarathne M, et al. Cascade: High throughput data streaming via decoupled access-execute CGRA[J]. ACM Transactions on Embedded Computing Systems, 2019, 18(5s): 50: 1−50: 27
    [14]
    Canis A, Choi J, Aldham M, et al. LegUp: An open-source high-level synthesis tool for FPGA-based processor/accelerator systems[J]. ACM Transactions on Embedded Computing Systems, 2013, 13(2): 24: 1−24: 27
    [15]
    Cong J, Jiang Wei, Liu Bin, et al. Automatic memory partitioning and scheduling for throughput and power optimization[C]//Proc of the 2009 Int Conf on Computer-Aided Design. New York: IEEE/ACM, 2009: 697−704
    [16]
    Li Peng, Wang Yuxin, Zhang Peng, et al. Memory partitioning and scheduling co-optimization in behavioral synthesis[C]//Proc of the 2012 Int Conf on Computer-Aided Design. New York: ACM, 2012: 488−495
    [17]
    Wang Yuxin, Li Peng, Cong J. Theory and algorithm for generalized memory partitioning in high-level synthesis[C]//Proc of the 22nd ACM/SIGDA Int Symp on Field-Programmable Gate Arrays. New York: ACM, 2014: 199−208
    [18]
    Liu Binbin, Yang Fan, Zhou Dian, et al. An efficient memory partitioning approach for multi-pattern data access in STT-RAM[C/OL]//Proc of the 2020 Int Symp on Circuits and Systems. Piscataway, NJ: IEEE, 2020[2023-12-01]. https://ieeexplore.ieee.org/document/9181278
    [19]
    Escobedo J, Lin Mingjie. Tessellation-based multi-block memory mapping scheme for high-level synthesis with FPGA[C]//Proc of the 24th Int Conf on Field-Programmable Technology. Piscataway, NJ: IEEE, 2016: 125−132
    [20]
    Zhou Yuan, Al-Hawaj K M, Zhang Zhiru. A new approach to automatic memory banking using trace-based address mining[C]//Proc of the 25th ACM/SIGDA Int Symp on Field-Programmable Gate Arrays. New York: ACM, 2017: 179−188
    [21]
    Escobedo J, Lin Mingjie. Graph-theoretically optimal memory banking for stencil-based computing kernels[C]//Proc of the 26th ACM/SIGDA Int Symp on Field-Programmable Gate Arrays. New York: ACM, 2018: 199−208
    [22]
    Yin Shouyi, Xie Zhicong, Meng Chenyue, et al. Memory partitioning for parallel multipattern data access in multiple data arrays[J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2017, 37(2): 431−444
    [23]
    Yin Shouyi, Yao Xianqing, Liu Dajiang, et al. Memory-aware loop mapping on coarse-grained reconfigurable architectures[J]. IEEE Transactions on Very Large Scale Integration Systems, 2015, 24(5): 1895−1908
    [24]
    Yin Shouyi, Yao Xianqing, Lu Tianyi, et al. Conflict-free loop mapping for coarse-grained reconfigurable architecture with multi-bank memory[J]. IEEE Transactions on Parallel and Distributed Systems, 2017, 28(9): 2471−2485 doi: 10.1109/TPDS.2017.2682241
    [25]
    Yin Shouyi, Xie Zhicong, Meng Chenyue, et al. Multibank memory optimization for parallel data access in multiple data arrays[C/OL]//Proc of the 2016 IEEE/ACM Int Conf on Computer-Aided Design. Piscataway, NJ: IEEE, 2016[2023-12-01]. https://ieeexplore.ieee.org/document/7827609
    [26]
    Hwang C T, Lee J H, Hsu Y C. A formal approach to the scheduling problem in high level synthesis[J]. IEEE Transactions on Computer-Aided Design, 1991, 10(4): 464−475 doi: 10.1109/43.75629
    [27]
    Lattner C, Adve V. LLVM: A compilation framework for lifelong program analysis & transformation[C]//Proc of the 2004 Int Symp on Code Generation and Optimization. Piscataway, NJ: IEEE, 2004: 75−86
    [28]
    Cong J, Sarkar V, Reinman G, et al. Customizable domain-specific computing[J]. IEEE Design & Test of Computers, 2010, 28(2): 6−15
  • Related Articles

    [1]Qi Wenfa, Liu Yuxin, Guo Zongming. Survey of Automatic Removal of Moiré Pattern[J]. Journal of Computer Research and Development, 2024, 61(3): 728-747. DOI: 10.7544/issn1000-1239.202220797
    [2]Xie Minhui, Lu Youyou, Feng Yangyang, Shu Jiwu. A Recommendation Model Inference System Based on GPU Direct Storage Access Architecture[J]. Journal of Computer Research and Development, 2024, 61(3): 589-599. DOI: 10.7544/issn1000-1239.202330402
    [3]Zhang Kaixin, Wang Yijie, Bao Han, Kan Junhui. An Adaptive Erasure-Coded Data Access Method for Cross-Cloud Collaborative Scheduling of Storage and Computation[J]. Journal of Computer Research and Development, 2024, 61(3): 571-588. DOI: 10.7544/issn1000-1239.202330541
    [4]Wang Yuqing, Yang Qiusong, Li Mingshu. A Cache Replacement Policy Based on Instruction Flow Access Pattern Prediction[J]. Journal of Computer Research and Development, 2022, 59(1): 31-46. DOI: 10.7544/issn1000-1239.20200503
    [5]Ou Yan, Feng Yujing, Li Wenming, Ye Xiaochun, Wang Da, Fan Dongrui. Optimum Research on Inner-Inst Memory Access Conflict for Dataflow Architecture[J]. Journal of Computer Research and Development, 2019, 56(12): 2720-2732. DOI: 10.7544/issn1000-1239.2019.20190115
    [6]Li Wenming, Ye Xiaochun, Wang Da, Zheng Fang, Li Hongliang, Lin Han, Fan Dongrui, Sun Ninghui. MACT: Discrete Memory Access Requests Batch Processing Mechanism for High-Throughput Many-Core Processor[J]. Journal of Computer Research and Development, 2015, 52(6): 1254-1265. DOI: 10.7544/issn1000-1239.2015.20150154
    [7]Wang Yizhuo, Zuo Qi, Ji Weixing, Wang Xiaojun, Shi Feng. Memory-Aware Incremental Mapping of Applications to MPSoC[J]. Journal of Computer Research and Development, 2015, 52(5): 1198-1209. DOI: 10.7544/issn1000-1239.2015.20131960
    [8]Wang Lei, Liu Daofu, Chen Yunji, Chen Tianshi, Li Ling. Survey on Partitioning and Scheduling Policies of Shared Resources in Chip-Multiprocessor[J]. Journal of Computer Research and Development, 2013, 50(10): 2212-2227.
    [9]Chen Licheng, Cui Zehan, Bao Yungang, Chen Mingyu, Shen Linfeng, Liang Qi. An Approach for Monitoring Memory Address Traces with Functional Semantic Information[J]. Journal of Computer Research and Development, 2013, 50(5): 1100-1109.
    [10]Zhou Qian, Feng Xiaobing, and Zhang Zhaoqing. Software Pipelining with Cache Profiling Information[J]. Journal of Computer Research and Development, 2008, 45(5): 834-840.

Catalog

    Article views (33) PDF downloads (11) Cited by()

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return