深度学习处理器矩阵跨步访存冲突判定方法

张振兴; 赵永威; 文渊博; 郭崎; 陈云霁

doi:10.7544/issn1000-1239.202550124

深度学习处理器矩阵跨步访存冲突判定方法

A Strided Matrix Memory Access Conflict Determination Mechanism for Deep Learning Processors

摘要

摘要: 随着深度学习应用的迅猛发展，TensorCore GPU，TPU，MLU 等深度学习处理器（deep learning processor，DLP）迅速兴起，并在矩阵运算中展现出优异性能。为高效处理复杂矩阵数据，这类处理器通常以固定步长将子矩阵分块加载到片上存储器，形成特定的跨步访存（strided memory access）。然而，英伟达 H100、华为 Ascend 及寒武纪 MLU 等设备目前均依赖复杂的显式冲突管理，缺乏对矩阵跨步访存冲突的硬件调控，致使指令执行原子性和一致性难以保障。传统访存冲突检测方法侧重于标量运算，难以捕捉矩阵数据间的依赖冲突。针对这一现状，提出了一种矩阵跨步访存冲突判定方法，将冲突检测归约为二元一次不定方程求解，从而精确判断访存指令间的依赖关系。模拟实验表明该方法显著优化了大模型典型算子的访存性能：访存带宽利用率最高可达 94%（LPDDR存储器）与 91%（HBM存储器），且与 Power 测试和集合相交方法相比，平均判定开销分别降至 4.68% 和 0.31%。此外，基于 12 nm工艺的硬件评估显示，比较器面积仅 0.02333 mm²，功耗 4.1194 mW。总体而言，该机制由复杂的显式管理转向高效硬件判定，有效提升了深度学习处理器的存储管理能力。

Abstract: With the rapid development of deep learning applications, deep learning processors (DLPs) such as TensorCore GPU, TPU, MLU have risen rapidly and demonstrated excellent performance in matrix operations. To efficiently handle complex matrix data, these processors typically partition sub-matrices and load them into on-chip memory with fixed stride, establishing a distinct pattern of strided memory access. However, devices like NVIDIA H100, Huawei Ascend, and Cambricon MLU currently depend on complex explicit conflict management, lacking dedicated hardware controls for strided memory access conflicts, thus making it challenging to ensure the atomicity and consistency of instruction execution. Traditional memory conflict detection methods focus on scalar operations and have difficulty capturing dependency conflicts among matrix data. To address this issue, we propose a strided matrix memory access conflict determination method that transforms conflict detection into the problem of solving a binary linear Diophantine equation, thereby accurately determining the dependency relationships between memory access instructions. Simulation experiments demonstrate that this approach markedly enhances the memory performance of typical operators in large language models: memory bandwidth utilization can peak at 94% (LPDDR) and 91% (HBM), while the average determination overhead is reduced to 4.68% and 0.31% when compared with the Power test and set intersection methods, respectively. In addition, hardware evaluation based on a 12 nm process shows that the determination design occupies only 0.02333 mm² and the power consumption is 4.1194 mW. Overall, this mechanism transitions from complex explicit management to efficient hardware-based determination, effectively improving the memory management of DLPs.

HTML全文

参考文献(33)

施引文献

资源附件(0)