Shuffle-SRAM: In-SRAM Parallel Bitwise Data Shuffle

Zhang Dunbo; Zeng Lingling; Wang Ruoxi; Wang Yaohua; Shen Li

doi:10.7544/issn1000-1239.202330151

Journal of Computer Research and Development > 2025 > 62(1): 75-89. > DOI: 10.7544/issn1000-1239.202330151 CSTR: 32373.14.issn1000-1239.202330151

Zhang Dunbo, Zeng Lingling, Wang Ruoxi, Wang Yaohua, Shen Li. Shuffle-SRAM: In-SRAM Parallel Bitwise Data Shuffle[J]. Journal of Computer Research and Development, 2025, 62(1): 75-89. DOI: 10.7544/issn1000-1239.202330151

Citation:

PDF (2440 KB)

Shuffle-SRAM: In-SRAM Parallel Bitwise Data Shuffle

1.
College of Computer Science and Technology, National University of Defense Technology, Changsha 410073
2.
Key Laboratory of Advanced Microprocessors Chips and Systems (National University of Defense Technology), Changsha 410073

Funds: This work was supported by the National Natural Science Foundation of China (61972407).

More Information

Author Bio:
Zhang Dunbo: born in 1996. PhD candidate. His main research interests include computer architecture, processing-in-memory and SRAM

Zeng Lingling: born in 1996. Master candidate. His main research interests include computer architecture, processing-in-memory and SRAM

Wang Ruoxi: born in 2000. Master. Her main research interest includes computer architecture

Wang Yaohua: born in 1985. PhD, professor, PhD supervisor. His main research interests include high performance computing, accelerator architecture, and memory system optimization

Shen Li: born in 1976. PhD, professor, PhD supervisor. His main research interests include multi/many-core architecture, runtime & compilation optimization, processing-in-memory, and high performance computing
Received Date: March 09, 2023
Revised Date: January 14, 2024
Available Online: November 12, 2024

Graphical Abstract

Abstract

Abstract

While vector processing unit is widely employed in processors for neural networks, signal processing, and high performance computing, it suffers from expensive shuffle operations dedicated to data alignment. Traditionally, processors handle shuffle operations with its data shuffle unit. However, data shuffle unit will introduce expensive overhead of data movement and only can shuffle data in serial. In fact, shuffle operations only change the layout of data and ideally should be done entirely within memory. Nowadays, SRAM is no longer just a storage component, but also as a computing unit. To this end, we propose Shuffle-SRAM in this paper, and Shuffle-SRAM can shuffle multiple data elements simultaneously bit by bit within an SRAM bank. The key idea is to exploit the bit-line wise data movement ability in SRAM so as to shuffle multiple data in parallel, where all the bits of different data elements on the same bit-line of SRAM can be shuffled simultaneously, achieving a high level of parallelism. Through suitable data layout preparation and the vector shuffle extension instructions, Shuffle-SRAM efficiently supports a wide range of commonly used shuffle operations efficiently. Our evaluation results show that Shuffle-SRAM can achieve a performance gain of 28 times for commonly used shuffle operations and 3.18 times for real world applications including FFT, AlexNet, and VggNet. The SRAM area overhead only increases by 4.4%.
- vector SIMD architecture,
- SRAM,
- shuffle operations,
- vector memory,
- processing in memory

FullText(HTML)

References (32)

References

[1]	Chen Weihao, Li Kaixiang, Lin Weiyu, et al. A 65nm 1MB nonvolatile computing-in-memory reram macro with sub−16ns multiply-and-accumulate for binary DNN AI edge processors[C]//Proc of IEEE Int Solid-State Circuits Conf. Piscataway, NJ: IEEE, 2018: 494−496
[2]	Domingos J, Neves N, Roma N, et al. Unlimited vector extension with data streaming support[C]//Proc of the 48th ACM/IEEE Annual Int Symp on Computer Architecture. Piscataway, NJ: IEEE, 2021: 209−222
[3]	Malkowsky S, Prabhu H, Liu Liang, et al. A programmable 16-lane SIMD ASIP for massive MIMO[C/OL]//Proc of IEEE Int Symp on Circuits and Systems. Piscataway, NJ: IEEE, 2019[2023-01-13]. https://ieeexplore-ieee-org-s.libyc.nudt.edu.cn/stamp/stamp.jsp?tp=&arnumber=8702770
[4]	Stephens N, Biles S, Boettcher M, et al. The ARM scalable vector extension[J]. IEEE Micro, 2017, 37(2): 26−39 doi: 10.1109/MM.2017.35
[5]	Tagliavini G, Mach S, Rossi D, et al. Design and evaluation of smallfloat SIMD extensions to the RISC-V ISA[C]//Proc of Design, Automation & Test in Europe Conf & Exhibition. Piscataway, NJ: IEEE, 2019: 654−657
[6]	Fowers J, Ovtcharov K, Papamichael M T, et al. A configurable cloud-scale DNN processor for real-time AI[C/OL]//Proc of the 45th ACM/IEEE Annual Int Symp on Computer Architecture. Piscataway, NJ: IEEE, 2018[2023-01-13]. https://web.tecnico.ulisboa.pt/~joaomiguelvieira/public/PTDC/EEI-HAC/4583/2021/JF18.pdf
[7]	Khailany B, Dally W, Kapasi U, et al. Imagine: Media processing with streams[J]. IEEE Micro, 2001, 21(2): 35−46 doi: 10.1109/40.918001
[8]	ANDES Technology, Andescore^TM NX27V processor[EB/OL]. 2020[2023-01-13]. https://www.andestech.com/en/products-solutions/andescore-processors/riscv-nx27v/
[9]	Liao Heng, Tu Jiajin, Xia Jing, et al. DaVinci: A scalable architecture for neural network computing[C/OL]//Proc of the 31st IEEE Hot Chips Symp. Piscataway, NJ: IEEE, 2019[2023-01-13]. https://pdfs.semanticscholar.org/4ca3/69ee20343433bd50c288c01ebcba2d6a03b2.pdf
[10]	Krashinsky R, Batten C, Hampton M, et al. The vector-thread architecture[J]. IEEE Micro, 2004, 24(6): 84−90 doi: 10.1109/MM.2004.90
[11]	Lee Y, Avizienis R, Bishara A, et al. Exploring the tradeoffs between programmability and efficiency in data-parallel accelerators[C]//Proc of the 38th Annual Int Symp on Computer Architecture. Piscataway, NJ: IEEE, 2011: 129−140
[12]	Lin Yuan, Lee H, Woh M, et al. SODA: A low-power architecture for software radio[C]//Proc of the 33rd Int Symp on Computer Architecture. Piscataway, NJ: IEEE, 2006: 89−101
[13]	Chen Shuming, Wang Yaohua, Liu Sheng, et al. FT-Matrix: A coordination-aware architecture for signal processing[J]. IEEE Micro, 2014, 34(6): 64−73 doi: 10.1109/MM.2013.129
[14]	Woh M, Seo S, Mahlke S, et al. AnySP: Anytime anywhere anyway signal processing[J]. IEEE Micro, 2010, 30(1): 81−91 doi: 10.1109/MM.2010.8
[15]	Chen Y, Emer J, Sze V. Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks[C]//Proc of the 43rd ACM/IEEE Annual Int Symp on Computer Architecture. Piscataway, NJ: IEEE, 2016: 367−379
[16]	Cong J, Xiao Bingjun. Minimizing computation in convolutional neural networks[C]//Proc of the 24th Int Conf on Artificial Neural Networks. Berlin: Springer, 2014: 281−290
[17]	Chellapilla K, Puri S, Simard P. High performance convolutional neural networks for document processing[C/OL]//Proc of the 10th Int Workshop on Frontiers in Handwriting Recognition. La Baule: Université de Rennes 1, 2006[2023-01-13]. https://inria.hal.science/file/index/docid/112631/filename/p1038112283956.pdf
[18]	Hsu S, Agarwal A, Anders M, et al. A 280 MV-to-1.1 V 256b reconfigurable SIMD vector permutation engine with 2-dimensional shuffle in 22 nm tri-gate CMOS[J]. IEEE Journal of Solid-State Circuits, 2013, 48(1): 118−127 doi: 10.1109/JSSC.2012.2222811
[19]	Raghavan P, Munaga S, Ramos E, et al. A customized cross-bar for data-shuffling in domain-specific SIMD processors[C]//Proc of the 20th Int Conf on Architecture of Computer System. New York: ACM, 2007: 57−68
[20]	National Center for Biotechnology Information. PubChem patent summary for US−7631025-B2, method and apparatus for rearranging data between multiple registers[EB/OL].[2023-01-13]. https://pubchem.ncbi.nlm.nih.gov/patent/US-7631025-B2
[21]	Veluri H, Li Yida, Niu Xuhua, et al. High-throughput, area-efficient, and variation-tolerant 3-D in-memory compute system for deep convolutional neural networks[J]. IEEE Internet of Things Journal, 2021, 8(11): 9219−9232 doi: 10.1109/JIOT.2021.3058015
[22]	Wang Yaohua, Li Chen, Liu Chang, et al. Advancing DSP into HPC, AI, and beyond: Challenges, mechanisms, and future directions[J]. Transactions on High Performance Computing, 2021, 3(1): 114−125 doi: 10.1007/s42514-020-00057-2
[23]	Wang Yaohua, Wang Dong, Chen Shuming, et al. Iteration interleaving-based SIMD lane partition[J]. ACM Transactions on Architecture and Code Optimization, 2016, 12(4): 1−18
[24]	Yang Xuejun, Yan Xiaobo, Xing Zuocheng, et al. A 64-bit stream processor architecture for scientific applications[J]. ACM SIGARCH Computer Architecture News, 2007, 35(2): 210−219 doi: 10.1145/1273440.1250689
[25]	Hennessy J, Patterson. Computer Architecture: A Quantitative Approach[M]. Amsterdam: Elsevier, 2017
[26]	Eckert C, Wang Xiaowei, Wang Jingcheng, et al. Neural cache: Bit-serial in-cache acceleration of deep neural networks[C]//Proc of the 45th ACM/IEEE Annual Int Symp on Computer Architecture. Piscataway, NJ: IEEE, 2018: 383−396
[27]	Wang Jingcheng, Wang Xiaowei, Eckert C, et al. A 28-nm compute SRAM with bit-serial logic/arithmetic operations for programmable in-memory vector computing[J]. IEEE Journal of Solid-State Circuits, 2020, 55(1): 76−86 doi: 10.1109/JSSC.2019.2939682
[28]	Aga S, Jeloka S, Subramaniyan A, et al. Compute caches[C]//Proc of IEEE Int Symp on High Performance Computer Architecture. Piscataway, NJ: IEEE, 2017: 481−492
[29]	Huang Libo, Shen Li, Wang Zhiying, et al. SIF: Overcoming the limitations of SIMD devices via implicit permutation[C]//Proc of the 16th IEEE Int Symp on High Performance Computer Architecture. Piscataway, NJ: IEEE, 2010: 303−314
[30]	Cochran W, Cooley J, Favin D, et al. What is the fast Fourier transform[J]. IEEE Transactions on Audio and Electroacoustics, 1967, 15(2): 45−55 doi: 10.1109/TAU.1967.1161899
[31]	Karen S, Andrew Z. Very deep convolutional networks for large-scale image recognition[J]. arXiv preprint, arXiv: 1409.1556, 2015
[32]	Alex K, Ilya S, Geoffery E. ImageNet classification with deep convolutional neural networks[J]. Communications of the ACM, 2017, 60(6): 84−90 doi: 10.1145/3065386