Abstract:
With the rapid development of deep learning applications, deep learning processors (DLPs) such as TensorCore GPU, TPU, MLU have risen rapidly and demonstrated excellent performance in matrix operations. To efficiently handle complex matrix data, these processors typically partition sub-matrices and load them into on-chip memory with fixed stride, establishing a distinct pattern of strided memory access. However, devices like NVIDIA H100, Huawei Ascend, and Cambricon MLU currently depend on complex explicit conflict management, lacking dedicated hardware controls for strided memory access conflicts, thus making it challenging to ensure the atomicity and consistency of instruction execution. Traditional memory conflict detection methods focus on scalar operations and have difficulty capturing dependency conflicts among matrix data. To address this issue, we propose a strided matrix memory access conflict determination method that transforms conflict detection into the problem of solving a binary linear Diophantine equation, thereby accurately determining the dependency relationships between memory access instructions. Simulation experiments demonstrate that this approach markedly enhances the memory performance of typical operators in large language models: memory bandwidth utilization can peak at 94% (LPDDR) and 91% (HBM), while the average determination overhead is reduced to 4.68% and 0.31% when compared to the Power test and set intersection methods, respectively. In addition, hardware evaluation based on a 12 nm process shows that the determination design occupies only
0.02333 mm
2 and the power consumption is
4.1194 mW. Overall, this mechanism transitions from complex explicit management to efficient hardware-based determination, effectively improving the memory management of DLPs.