Abstract:
While Vector Processing Unit is widely employed in processors for neural networks, signal processing, and high performance computing, it suffers from expensive shuffle operations dedicated to data alignment. Traditionally, processors handle shuffle operations with its data shuffle unit. However, data shuffle unit will introduce expensive overhead of data movement and only can shuffle data in serial. In fact, shuffle operations only change the layout of data and ideally should be done entirely within memory. Nowadays, SRAM is no longer just a storage component, but also as a computing unit. To this end, we propose Shuffle-SRAM in this paper, which can shuffle multiple data elements simultaneously bit by bit within an SRAM bank. The key idea is to exploit the bit-line wise data movement ability in SRAM so as to shuffle multiple data in parallel, where all the bits of different data elements on the same bit-line of SRAM can be shuffled simultaneously, achieving a high level of parallelism. Through suitable data layout preparation and the vector shuffle extension instructions, Shuffle-SRAM efficiently supports a wide range of commonly used shuffle operations efficiently. Our evaluation results show that Shuffle-SRAM can achieve a performance gain of 28x for commonly used shuffle operations and 3.18x for real world applications including FFT, AlexNet, and VggNet. The SRAM area overhead only increases by 4.4%.