A Recommendation Model Inference System Based on GPU Direct Storage Access Architecture

Xie Minhui; Lu Youyou; Feng Yangyang; Shu Jiwu

doi:10.7544/issn1000-1239.202330402

Journal of Computer Research and Development > 2024 > 61(3): 589-599. > DOI: 10.7544/issn1000-1239.202330402 CSTR: 32373.14.issn1000-1239.202330402

Xie Minhui, Lu Youyou, Feng Yangyang, Shu Jiwu. A Recommendation Model Inference System Based on GPU Direct Storage Access Architecture[J]. Journal of Computer Research and Development, 2024, 61(3): 589-599. DOI: 10.7544/issn1000-1239.202330402

Citation:

PDF (1402 KB)

A Recommendation Model Inference System Based on GPU Direct Storage Access Architecture

Department of Computer Science and Technology, Tsinghua University, Beijing 100084

Funds: This work was supported by the National Natural Science Foundation of China for Excellent Young Scientists (62022051).

More Information

Author Bio:
Xie Minhui: born in 1997. PhD candidate. Student member of CCF. His main research interests include storage systems and machine learning systems

Lu Youyou: born in 1987. PhD, associate professor, PhD supervisor. Senior member of CCF. His main research interest includes storage systems

Feng Yangyang: born in 1998. PhD candidate. Student member of CCF. His main research interests include storage systems and machine learning systems

Shu Jiwu: born in 1968. PhD, professor, PhD supervisor. Fellow of CCF. His main research interests include non-volatile memory systems and technologies, storage security and reliability, and parallel and distributed computing
Received Date: May 22, 2023
Revised Date: October 15, 2023
Available Online: November 30, 2023

Graphical Abstract

Abstract

Abstract

Emerging deep learning recommendation models (DLRM) have been widely used in modern recommendation systems. The unique embedding layer of DLRM, commonly with tens of trillions of parameters, induces massive irregular access to storage resources, which becomes the performance bottleneck of model inference. Existing inference systems rely on CPU access to embedding parameters on DRAM and SSD. However, we find that this architecture suffers from excessive CPU-GPU communication overhead and redundant memory copies, resulting in increased latency of embedding layers and limited inference performance. In this paper, we propose GDRec, a recommendation model inference system based on the architecture of GPU direct storage access. The core idea of GDRec is to eliminate the CPU from the access path of embedding parameters and let the GPU directly access storage resources with the paradigm of zero copy. For direct access to DRAM, GDRec retrofits the unified virtual addressing feature of CUDA, to allow GPU kernels to issue fine-grained access to host DRAM. GDRec further introduces two optimizations, access coalescing and access aligning, to fully unleash the performance of DRAM access. For direct access to SSD, GDRec implements a lightweight NVMe driver on GPU, allowing GPU to submit I/O commands to read data from SSD to GPU memory directly, without extra copies on DRAM. GDRec also leverages the massive parallelism of GPU to shorten the submission time of I/O commands. Experiments on three public datasets show that GDRec can improve inference throughput by 1.9 times, compared with a highly-optimized recommendation model inference system, NVIDIA HugeCTR.
- GPU direct storage access,
- parameter store,
- recommendation system,
- inference system,
- storage system

FullText(HTML)

References (34)

References

[1]	维基百科. 新浪微博 [EB/OL]. [2023-03-11].https://zh.wikipedia.org/wiki/%E6%96%B0%E6%B5%AA%E5%BE%AE%E5%8D%9A Wikipedia. Sina Weibo [EB/OL]. [2023-03-11].https://zh.wikipedia.org/wiki/%E6%96%B0%E6%B5%AA%E5%BE%AE%E5%8D%9A (in Chinese)
[2]	Xie Minhui, Ren Kai, Lu Youyou, et al. Kraken: Memory-efficient continual learning for large-scale real-time recommendations [C/OL] // Proc of Int Conf for High Performance Computing, Networking, Storage and Analysis. Piscataway, NJ: IEEE, 2020[2023-03-11].https://ieeexplore.ieee.org/document/9355295
[3]	Gupta U, Wu C J, Wang Xiaodong, et al. The architectural implications of Facebook’s DNN-based personalized recommendation[C] // Proc of the 26th Int Symp on High Performance Computer Architecture (HPCA). Piscataway, NJ: IEEE, 2020: 488−501
[4]	Naumov M, Mudigere D, Shi H J M, et al. Deep learning recommendation model for personalization and recommendation systems[J]. arXiv preprint, arXiv: 1906. 00091, 2019
[5]	Cheng H T, Koc L, Harmsen J, et al. Wide & deep learning for recommender systems[C] // Proc of the 1st Workshop on Deep Learning for Recommender Systems. New York: ACM, 2016: 7−10
[6]	Shan Ying, Hoens T R, Jiao Jian, et al. Deep crossing: Web-scale modeling without manually crafted combinatorial features[C] // Proc of the 22nd ACM SIGKDD Int Conf on Knowledge Discovery and Data Mining. New York: ACM, 2016: 255−262
[7]	Zhao Weijie, Xie Deping, Jia Ronglai, et al. Distributed hierarchical GPU parameter server for massive scale deep learning ads systems [C/OL] // Proc of the 3rd Conf on Machine Learning and Systems. 2020: 412-428[2023-03-11].https://proceedings.mlsys.org/paper_files/paper/2020/hash/6e426f4c16c6677a605375ae2e4877d5-Abstract.html
[8]	Kurniawan D H, Wang Ruipu, Zulkifli K S, et al. EVSTORE: Storage and caching capabilities for scaling embedding tables in deep recommendation systems [C] // Proc of the 28th ACM Int Conf on Architectural Support for Programming Languages and Operating Systems. New York: ACM, 2023: 281–294
[9]	Xie Minhui, Lu Youyou, Wang Qing, et al. PetPS: Supporting huge embedding models with persistent memory[C] // Proc of the 49th Int Conf on Very Large Data Bases. New York: ACM, 2023: 1013−1022
[10]	Eisenman A, Naumov M, Gardner D, et al. Bandana: Using non-volatile memory for storing deep learning models [C/OL] // Proc of the 2nd Conf on Machine Learning and Systems. 2019: 40−52[2023-03-11].https://proceedings.mlsys.org/paper_files/paper/2019/hash/d59a1dc497cf2773637256f50f492723-Abstract.html
[11]	Wei Yingcan, Langer M, Yu Fan, et al. A GPU-specialized inference parameter server for large-scale deep recommendation models[C] // Proc of the 16th ACM Conf on Recommender Systems. New York: ACM, 2022: 408−419
[12]	NVIDIA. Virtual memory management [EB/OL]. [2023-03-11].https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__VA.html
[13]	Intel. Storage performance development kit [EB/OL]. [2023-03-11].https://spdk.io
[14]	刘知远,孙茂松,林衍凯,等. 知识表示学习研究进展[J]. 计算机研究与发展,2016,53(2):247−261 Liu Zhiyuan, Sun Maosong, Lin Yankai, et al. Knowledge representation learning: A review[J]. Journal of Computer Research and Development, 2016, 53(2): 247−261 (in Chinese)
[15]	Guo Huifeng, Tang Ruiming, Ye Yunming, et al. DeepFM: A factorization-machine based neural network for CTR prediction[J]. arXiv preprint, arXiv: 1703. 04247, 2017
[16]	Wang Ruoxi, Fu Bin, Fu Gang, et al. Deep & cross network for ad click predictions [C/OL] // Proc of the 8th Int Workshop on Data Mining for Online Advertising. New York: ACM, 2017[2023-03-11].https://dl.acm.org/doi/10.1145/3124749.3124754
[17]	Zhou Guorui, Zhu Xiaoqiang, Song Chenru, et al. Deep interest network for click-through rate prediction[C] // Proc of the 24th ACM SIGKDD Int Conf on Knowledge Discovery & Data Mining. New York: ACM, 2018: 1059−1068
[18]	Zhou Guorui, Mou Na, Fan Ying, et al. Deep interest evolution network for click-through rate prediction[C] // Proc of the 32nd AAAI Conf on Artificial Intelligence. Piscataway, NJ: IEEE, 2019: 5941−5948
[19]	Jiang Wenqi, He Zhenhao, Zhang Shuai, et al. FleetRec: Large-scale recommendation inference on hybrid GPU-FPGA clusters[C] // Proc of the 27th ACM SIGKDD Conf on Knowledge Discovery & Data Mining. New York: ACM, 2021: 3097−3105
[20]	Lian Xiangru, Yuan Binhang, Zhu Xuefeng, et al. Persia: A hybrid system scaling deep learning based recommenders up to 100 trillion parameters[J]. arXiv preprint, arXiv: 2111. 05897, 2021
[21]	Ardestani E K, Kim C, Lee S J, et al. Supporting massive DLRM inference through software defined memory[C] // Proc of the 42nd IEEE Int Conf on Distributed Computing Systems (ICDCS). Piscataway, NJ: IEEE, 2022: 302−312
[22]	Wan Hu, Sun Xuan, Cui Yufei, et al. FlashEmbedding: Storing embedding tables in SSD for large-scale recommender systems[C] // Proc of the 12th ACM SIGOPS Asia-Pacific Workshop on Systems. New York: ACM, 2021: 9−16
[23]	Wilkening M, Gupta U, Hsia S, et al. RecSSD: Near data processing for solid state drive based recommendation inference[C] // Proc of the 26th ACM Int Conf on Architectural Support for Programming Languages and Operating Systems. New York: ACM, 2021: 717−729
[24]	Zhao M, Agarwal N, Basant A, et al. Understanding data storage and ingestion for large-scale deep recommendation model training: Industrial product[C] // Proc of the 49th Annual Int Symp on Computer Architecture. New York: ACM, 2022: 1042−1057
[25]	Lee Y, Seo S H, Choi H, et al. MERCI: Efficient embedding reduction on commodity hardware via sub-query memoization[C] // Proc of the 26th ACM Int Conf on Architectural Support for Programming Languages and Operating Systems. New York: ACM, 2021: 302−313
[26]	Facebook. Facebook/RocksDB: A library that provides an embeddable, persistent key-value store for fast storage [EB/OL]. [2023-03-11].https://github.com/facebook/rocksdb
[27]	Xie Minhui, Lu Youyou, Lin Jiazhen, et al. Fleche: An efficient GPU embedding cache for personalized recommendations[C] // Proc of the 17th European Conf on Computer Systems. New York: ACM, 2022: 402−416
[28]	Jiang Wenqi, He Zhenhao, Zhang Shuai, et al. MicroRec: Efficient recommendation inference by hardware and data structure solutions [C/OL]. Proc of the 4th Conf on Machine Learning and Systems. 2021: 845−859[2023-03-11].https://proceedings.mlsys.org/paper_files/paper/2021/hash/9e9a5486cb2f8e44d5b5fedd2a9e5fcd-Abstract.html
[29]	Ke Liu, Gupta U, Cho B Y, et al. RecNMP: Accelerating personalized recommendation with near-memory processing[C] // Proc of the 47th ACM/IEEE Annual Int Symp on Computer Architecture (ISCA). Piscataway, NJ: IEEE 2020: 790−803
[30]	Kwon Y, Lee Y, Rhu M. TensorDIMM: A practical near-memory processing architecture for embeddings and tensor operations in deep learning[C] // Proc of the 52nd Annual IEEE/ACM Int Symp on Microarchitecture. New York: ACM, 2019: 740−753
[31]	Asgari B, Hadidi R, Cao J, et al. FAFNIR: Accelerating sparse gathering by using efficient near-memory intelligent reduction[C] // Proc of the 27th IEEE Int Symp on High-Performance Computer Architecture (HPCA). Piscataway, NJ: IEEE 2021: 908−920
[32]	Min S W, Mailthody V S, Qureshi Z, et al. EMOGI: Efficient memory-access for out-of-memory graph-traversal in GPUs[J]. arXiv preprint, arXiv: 2006. 06890, 2020
[33]	Min S W, Wu Kun, Huang Sitao, et al. Large graph convolutional network training with GPU-oriented data communication architecture[J]. arXiv preprint, arXiv: 2103. 03330, 2021
[34]	Qureshi Z, Mailthody V S, Gelado I, et al. BaM: A case for enabling fine-grain high throughput GPU-orchestrated access to storage[J]. arXiv preprint, arXiv: 2203. 04910, 2022

Cited By

Cited by

Periodical cited type(2)

1.	樊青龙，耿磊，邓亚明，梁志斌，张敬文，董刚. 五举煤业智能洗选综合管控平台设计与应用. 选煤技术. 2025(01): 64-74 .
2.	陈秀丽. 分布式数据库系统在云计算环境中的数据一致性保障机制. 信息与电脑(理论版). 2024(08): 137-139 .