基于GPU直访存储架构的推荐模型预估系统

谢旻晖; 陆游游; 冯杨洋; 舒继武

doi:10.7544/issn1000-1239.202330402

基于GPU直访存储架构的推荐模型预估系统

A Recommendation Model Inference System Based on GPU Direct Storage Access Architecture

摘要

摘要: 新型深度学习推荐模型已广泛应用至现代推荐系统，其独有的特征——包含万亿嵌入参数的嵌入层，带来的大量不规则稀疏访问已成为模型预估的性能瓶颈. 然而，现有的推荐模型预估系统依赖CPU对内存、外存等存储资源上的嵌入参数进行访问，存在着CPU-GPU通信开销大和额外的内存拷贝2个问题，这增加了嵌入层的访存延迟，进而损害模型预估的性能. 提出了一种基于GPU直访存储架构的推荐模型预估系统GDRec.GDRec的核心思想是在嵌入参数的访问路径上移除CPU参与，由GPU通过零拷贝的方式高效直访内外存资源. 对于内存直访，GDRec利用统一计算设备架构（compute unified device architecture，CUDA）提供的统一虚拟地址特性，实现GPU 核心函数（kernel）对主机内存的细粒度访问，并引入访问合并与访问对齐2个机制充分优化访存性能；对于外存直访，GDRec实现了一个轻量的固态硬盘（solid state disk，SSD）驱动程序，允许GPU从SSD中直接读取数据至显存，避免内存上的额外拷贝，GDRec还利用GPU的并行性缩短提交I/O请求的时间. 在3个点击率预估数据集上的实验表明，GDRec在性能上优于高度优化后的基于CPU访存架构的系统NVIDIA HugeCTR，可以提升多达1.9倍的吞吐量.

Abstract: Emerging deep learning recommendation models (DLRM) have been widely used in modern recommendation systems. The unique embedding layer of DLRM, commonly with tens of trillions of parameters, induces massive irregular access to storage resources, which becomes the performance bottleneck of model inference. Existing inference systems rely on CPU access to embedding parameters on DRAM and SSD. However, we find that this architecture suffers from excessive CPU-GPU communication overhead and redundant memory copies, resulting in increased latency of embedding layers and limited inference performance. In this paper, we propose GDRec, a recommendation model inference system based on the architecture of GPU direct storage access. The core idea of GDRec is to eliminate the CPU from the access path of embedding parameters and let the GPU directly access storage resources with the paradigm of zero copy. For direct access to DRAM, GDRec retrofits the unified virtual addressing feature of CUDA, to allow GPU kernels to issue fine-grained access to host DRAM. GDRec further introduces two optimizations, access coalescing and access aligning, to fully unleash the performance of DRAM access. For direct access to SSD, GDRec implements a lightweight NVMe driver on GPU, allowing GPU to submit I/O commands to read data from SSD to GPU memory directly, without extra copies on DRAM. GDRec also leverages the massive parallelism of GPU to shorten the submission time of I/O commands. Experiments on three public datasets show that GDRec can improve inference throughput by 1.9 times, compared with a highly-optimized recommendation model inference system, NVIDIA HugeCTR.

HTML全文

参考文献(34)

施引文献

资源附件(0)