Abstract:
Emerging deep learning recommendation models (DLRM) have been widely used in modern recommendation systems. The unique embedding layer of DLRM, commonly with tens of trillions of parameters, induces massive irregular access to storage resources, which becomes the performance bottleneck of model inference. Existing inference systems rely on CPU access to embedding parameters on DRAM and SSD. However, we find that this architecture suffers from excessive CPU-GPU communication overhead and redundant memory copies, resulting in increased latency of embedding layers and limited inference performance. In this paper, we propose GDRec, a recommendation model inference system based on the architecture of GPU direct storage access. The core idea of GDRec is to eliminate the CPU from the access path of embedding parameters and let the GPU directly access storage resources with the paradigm of zero copy. For direct access to DRAM, GDRec retrofits the unified virtual addressing feature of CUDA, to allow GPU kernels to issue fine-grained access to host DRAM. GDRec further introduces two optimizations, access coalescing and access aligning, to fully unleash the performance of DRAM access. For direct access to SSD, GDRec implements a lightweight NVMe driver on GPU, allowing GPU to submit I/O commands to read data from SSD to GPU memory directly, without extra copies on DRAM. GDRec also leverages the massive parallelism of GPU to shorten the submission time of I/O commands. Experiments on three public datasets show that GDRec can improve inference throughput by 1.9 times, compared with a highly-optimized recommendation model inference system, NVIDIA HugeCTR.