Citation: | Zhang Dunbo, Zeng Lingling, Wang Ruoxi, Wang Yaohua, Shen Li. Shuffle-SRAM: In-SRAM Parallel Bitwise Data Shuffle[J]. Journal of Computer Research and Development, 2025, 62(1): 75-89. DOI: 10.7544/issn1000-1239.202330151 |
While vector processing unit is widely employed in processors for neural networks, signal processing, and high performance computing, it suffers from expensive shuffle operations dedicated to data alignment. Traditionally, processors handle shuffle operations with its data shuffle unit. However, data shuffle unit will introduce expensive overhead of data movement and only can shuffle data in serial. In fact, shuffle operations only change the layout of data and ideally should be done entirely within memory. Nowadays, SRAM is no longer just a storage component, but also as a computing unit. To this end, we propose Shuffle-SRAM in this paper, and Shuffle-SRAM can shuffle multiple data elements simultaneously bit by bit within an SRAM bank. The key idea is to exploit the bit-line wise data movement ability in SRAM so as to shuffle multiple data in parallel, where all the bits of different data elements on the same bit-line of SRAM can be shuffled simultaneously, achieving a high level of parallelism. Through suitable data layout preparation and the vector shuffle extension instructions, Shuffle-SRAM efficiently supports a wide range of commonly used shuffle operations efficiently. Our evaluation results show that Shuffle-SRAM can achieve a performance gain of 28 times for commonly used shuffle operations and 3.18 times for real world applications including FFT, AlexNet, and VggNet. The SRAM area overhead only increases by 4.4%.
[1] |
Chen Weihao, Li Kaixiang, Lin Weiyu, et al. A 65nm 1MB nonvolatile computing-in-memory reram macro with sub−16ns multiply-and-accumulate for binary DNN AI edge processors[C]//Proc of IEEE Int Solid-State Circuits Conf. Piscataway, NJ: IEEE, 2018: 494−496
|
[2] |
Domingos J, Neves N, Roma N, et al. Unlimited vector extension with data streaming support[C]//Proc of the 48th ACM/IEEE Annual Int Symp on Computer Architecture. Piscataway, NJ: IEEE, 2021: 209−222
|
[3] |
Malkowsky S, Prabhu H, Liu Liang, et al. A programmable 16-lane SIMD ASIP for massive MIMO[C/OL]//Proc of IEEE Int Symp on Circuits and Systems. Piscataway, NJ: IEEE, 2019[2023-01-13]. https://ieeexplore-ieee-org-s.libyc.nudt.edu.cn/stamp/stamp.jsp?tp=&arnumber=8702770
|
[4] |
Stephens N, Biles S, Boettcher M, et al. The ARM scalable vector extension[J]. IEEE Micro, 2017, 37(2): 26−39 doi: 10.1109/MM.2017.35
|
[5] |
Tagliavini G, Mach S, Rossi D, et al. Design and evaluation of smallfloat SIMD extensions to the RISC-V ISA[C]//Proc of Design, Automation & Test in Europe Conf & Exhibition. Piscataway, NJ: IEEE, 2019: 654−657
|
[6] |
Fowers J, Ovtcharov K, Papamichael M T, et al. A configurable cloud-scale DNN processor for real-time AI[C/OL]//Proc of the 45th ACM/IEEE Annual Int Symp on Computer Architecture. Piscataway, NJ: IEEE, 2018[2023-01-13]. https://web.tecnico.ulisboa.pt/~joaomiguelvieira/public/PTDC/EEI-HAC/4583/2021/JF18.pdf
|
[7] |
Khailany B, Dally W, Kapasi U, et al. Imagine: Media processing with streams[J]. IEEE Micro, 2001, 21(2): 35−46 doi: 10.1109/40.918001
|
[8] |
ANDES Technology, AndescoreTM NX27V processor[EB/OL]. 2020[2023-01-13]. https://www.andestech.com/en/products-solutions/andescore-processors/riscv-nx27v/
|
[9] |
Liao Heng, Tu Jiajin, Xia Jing, et al. DaVinci: A scalable architecture for neural network computing[C/OL]//Proc of the 31st IEEE Hot Chips Symp. Piscataway, NJ: IEEE, 2019[2023-01-13]. https://pdfs.semanticscholar.org/4ca3/69ee20343433bd50c288c01ebcba2d6a03b2.pdf
|
[10] |
Krashinsky R, Batten C, Hampton M, et al. The vector-thread architecture[J]. IEEE Micro, 2004, 24(6): 84−90 doi: 10.1109/MM.2004.90
|
[11] |
Lee Y, Avizienis R, Bishara A, et al. Exploring the tradeoffs between programmability and efficiency in data-parallel accelerators[C]//Proc of the 38th Annual Int Symp on Computer Architecture. Piscataway, NJ: IEEE, 2011: 129−140
|
[12] |
Lin Yuan, Lee H, Woh M, et al. SODA: A low-power architecture for software radio[C]//Proc of the 33rd Int Symp on Computer Architecture. Piscataway, NJ: IEEE, 2006: 89−101
|
[13] |
Chen Shuming, Wang Yaohua, Liu Sheng, et al. FT-Matrix: A coordination-aware architecture for signal processing[J]. IEEE Micro, 2014, 34(6): 64−73 doi: 10.1109/MM.2013.129
|
[14] |
Woh M, Seo S, Mahlke S, et al. AnySP: Anytime anywhere anyway signal processing[J]. IEEE Micro, 2010, 30(1): 81−91 doi: 10.1109/MM.2010.8
|
[15] |
Chen Y, Emer J, Sze V. Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks[C]//Proc of the 43rd ACM/IEEE Annual Int Symp on Computer Architecture. Piscataway, NJ: IEEE, 2016: 367−379
|
[16] |
Cong J, Xiao Bingjun. Minimizing computation in convolutional neural networks[C]//Proc of the 24th Int Conf on Artificial Neural Networks. Berlin: Springer, 2014: 281−290
|
[17] |
Chellapilla K, Puri S, Simard P. High performance convolutional neural networks for document processing[C/OL]//Proc of the 10th Int Workshop on Frontiers in Handwriting Recognition. La Baule: Université de Rennes 1, 2006[2023-01-13]. https://inria.hal.science/file/index/docid/112631/filename/p1038112283956.pdf
|
[18] |
Hsu S, Agarwal A, Anders M, et al. A 280 MV-to-1.1 V 256b reconfigurable SIMD vector permutation engine with 2-dimensional shuffle in 22 nm tri-gate CMOS[J]. IEEE Journal of Solid-State Circuits, 2013, 48(1): 118−127 doi: 10.1109/JSSC.2012.2222811
|
[19] |
Raghavan P, Munaga S, Ramos E, et al. A customized cross-bar for data-shuffling in domain-specific SIMD processors[C]//Proc of the 20th Int Conf on Architecture of Computer System. New York: ACM, 2007: 57−68
|
[20] |
National Center for Biotechnology Information. PubChem patent summary for US−7631025-B2, method and apparatus for rearranging data between multiple registers[EB/OL].[2023-01-13]. https://pubchem.ncbi.nlm.nih.gov/patent/US-7631025-B2
|
[21] |
Veluri H, Li Yida, Niu Xuhua, et al. High-throughput, area-efficient, and variation-tolerant 3-D in-memory compute system for deep convolutional neural networks[J]. IEEE Internet of Things Journal, 2021, 8(11): 9219−9232 doi: 10.1109/JIOT.2021.3058015
|
[22] |
Wang Yaohua, Li Chen, Liu Chang, et al. Advancing DSP into HPC, AI, and beyond: Challenges, mechanisms, and future directions[J]. Transactions on High Performance Computing, 2021, 3(1): 114−125 doi: 10.1007/s42514-020-00057-2
|
[23] |
Wang Yaohua, Wang Dong, Chen Shuming, et al. Iteration interleaving-based SIMD lane partition[J]. ACM Transactions on Architecture and Code Optimization, 2016, 12(4): 1−18
|
[24] |
Yang Xuejun, Yan Xiaobo, Xing Zuocheng, et al. A 64-bit stream processor architecture for scientific applications[J]. ACM SIGARCH Computer Architecture News, 2007, 35(2): 210−219 doi: 10.1145/1273440.1250689
|
[25] |
Hennessy J, Patterson. Computer Architecture: A Quantitative Approach[M]. Amsterdam: Elsevier, 2017
|
[26] |
Eckert C, Wang Xiaowei, Wang Jingcheng, et al. Neural cache: Bit-serial in-cache acceleration of deep neural networks[C]//Proc of the 45th ACM/IEEE Annual Int Symp on Computer Architecture. Piscataway, NJ: IEEE, 2018: 383−396
|
[27] |
Wang Jingcheng, Wang Xiaowei, Eckert C, et al. A 28-nm compute SRAM with bit-serial logic/arithmetic operations for programmable in-memory vector computing[J]. IEEE Journal of Solid-State Circuits, 2020, 55(1): 76−86 doi: 10.1109/JSSC.2019.2939682
|
[28] |
Aga S, Jeloka S, Subramaniyan A, et al. Compute caches[C]//Proc of IEEE Int Symp on High Performance Computer Architecture. Piscataway, NJ: IEEE, 2017: 481−492
|
[29] |
Huang Libo, Shen Li, Wang Zhiying, et al. SIF: Overcoming the limitations of SIMD devices via implicit permutation[C]//Proc of the 16th IEEE Int Symp on High Performance Computer Architecture. Piscataway, NJ: IEEE, 2010: 303−314
|
[30] |
Cochran W, Cooley J, Favin D, et al. What is the fast Fourier transform[J]. IEEE Transactions on Audio and Electroacoustics, 1967, 15(2): 45−55 doi: 10.1109/TAU.1967.1161899
|
[31] |
Karen S, Andrew Z. Very deep convolutional networks for large-scale image recognition[J]. arXiv preprint, arXiv: 1409.1556, 2015
|
[32] |
Alex K, Ilya S, Geoffery E. ImageNet classification with deep convolutional neural networks[J]. Communications of the ACM, 2017, 60(6): 84−90 doi: 10.1145/3065386
|
[1] | Fang Haotian, Li Chunhua, Wang Qing, Zhou Ke. A Method of Microservice Performance Anomaly Detection Based on Deep Learning[J]. Journal of Computer Research and Development, 2024, 61(3): 600-613. DOI: 10.7544/issn1000-1239.202330543 |
[2] | Yue Wenjing, Qu Wenwen, Lin Kuan, Wang Xiaoling. Survey of Cardinality Estimation Techniques Based on Machine Learning[J]. Journal of Computer Research and Development, 2024, 61(2): 413-427. DOI: 10.7544/issn1000-1239.202220649 |
[3] | Wang Rui, Qi Jianpeng, Chen Liang, Yang Long. Survey of Collaborative Inference for Edge Intelligence[J]. Journal of Computer Research and Development, 2023, 60(2): 398-414. DOI: 10.7544/issn1000-1239.202110867 |
[4] | Liu Qixu, Chen Yanhui, Ni Jieshuo, Luo Cheng, Liu Caiyun, Cao Yaqin, Tan Ru, Feng Yun, Zhang Yue. Survey on Machine Learning-Based Anomaly Detection for Industrial Internet[J]. Journal of Computer Research and Development, 2022, 59(5): 994-1014. DOI: 10.7544/issn1000-1239.20211147 |
[5] | Wang Jialai, Zhang Chao, Qi Xuyan, Rong Yi. A Survey of Intelligent Malware Detection on Windows Platform[J]. Journal of Computer Research and Development, 2021, 58(5): 977-994. DOI: 10.7544/issn1000-1239.2021.20200964 |
[6] | Chen Kerui, Meng Xiaofeng. Interpretation and Understanding in Machine Learning[J]. Journal of Computer Research and Development, 2020, 57(9): 1971-1986. DOI: 10.7544/issn1000-1239.2020.20190456 |
[7] | Liu Chenyi, Xu Mingwei, Geng Nan, Zhang Xiang. A Survey on Machine Learning Based Routing Algorithms[J]. Journal of Computer Research and Development, 2020, 57(4): 671-687. DOI: 10.7544/issn1000-1239.2020.20190866 |
[8] | Liu Junxu, Meng Xiaofeng. Survey on Privacy-Preserving Machine Learning[J]. Journal of Computer Research and Development, 2020, 57(2): 346-362. DOI: 10.7544/issn1000-1239.2020.20190455 |
[9] | Xu Xiaoxiang, Li Fanzhang, Zhang Li, Zhang Zhao. The Category Representation of Machine Learning Algorithm[J]. Journal of Computer Research and Development, 2017, 54(11): 2567-2575. DOI: 10.7544/issn1000-1239.2017.20160350 |
[10] | Wen Guihua. Relative Transformation for Machine Learning[J]. Journal of Computer Research and Development, 2008, 45(4): 612-618. |