Citation: | Xie Minhui, Lu Youyou, Feng Yangyang, Shu Jiwu. A Recommendation Model Inference System Based on GPU Direct Storage Access Architecture[J]. Journal of Computer Research and Development, 2024, 61(3): 589-599. DOI: 10.7544/issn1000-1239.202330402 |
Emerging deep learning recommendation models (DLRM) have been widely used in modern recommendation systems. The unique embedding layer of DLRM, commonly with tens of trillions of parameters, induces massive irregular access to storage resources, which becomes the performance bottleneck of model inference. Existing inference systems rely on CPU access to embedding parameters on DRAM and SSD. However, we find that this architecture suffers from excessive CPU-GPU communication overhead and redundant memory copies, resulting in increased latency of embedding layers and limited inference performance. In this paper, we propose GDRec, a recommendation model inference system based on the architecture of GPU direct storage access. The core idea of GDRec is to eliminate the CPU from the access path of embedding parameters and let the GPU directly access storage resources with the paradigm of zero copy. For direct access to DRAM, GDRec retrofits the unified virtual addressing feature of CUDA, to allow GPU kernels to issue fine-grained access to host DRAM. GDRec further introduces two optimizations, access coalescing and access aligning, to fully unleash the performance of DRAM access. For direct access to SSD, GDRec implements a lightweight NVMe driver on GPU, allowing GPU to submit I/O commands to read data from SSD to GPU memory directly, without extra copies on DRAM. GDRec also leverages the massive parallelism of GPU to shorten the submission time of I/O commands. Experiments on three public datasets show that GDRec can improve inference throughput by 1.9 times, compared with a highly-optimized recommendation model inference system, NVIDIA HugeCTR.
[1] |
维基百科. 新浪微博 [EB/OL]. [2023-03-11].https://zh.wikipedia.org/wiki/%E6%96%B0%E6%B5%AA%E5%BE%AE%E5%8D%9A
Wikipedia. Sina Weibo [EB/OL]. [2023-03-11].https://zh.wikipedia.org/wiki/%E6%96%B0%E6%B5%AA%E5%BE%AE%E5%8D%9A (in Chinese)
|
[2] |
Xie Minhui, Ren Kai, Lu Youyou, et al. Kraken: Memory-efficient continual learning for large-scale real-time recommendations [C/OL] // Proc of Int Conf for High Performance Computing, Networking, Storage and Analysis. Piscataway, NJ: IEEE, 2020[2023-03-11].https://ieeexplore.ieee.org/document/9355295
|
[3] |
Gupta U, Wu C J, Wang Xiaodong, et al. The architectural implications of Facebook’s DNN-based personalized recommendation[C] // Proc of the 26th Int Symp on High Performance Computer Architecture (HPCA). Piscataway, NJ: IEEE, 2020: 488−501
|
[4] |
Naumov M, Mudigere D, Shi H J M, et al. Deep learning recommendation model for personalization and recommendation systems[J]. arXiv preprint, arXiv: 1906. 00091, 2019
|
[5] |
Cheng H T, Koc L, Harmsen J, et al. Wide & deep learning for recommender systems[C] // Proc of the 1st Workshop on Deep Learning for Recommender Systems. New York: ACM, 2016: 7−10
|
[6] |
Shan Ying, Hoens T R, Jiao Jian, et al. Deep crossing: Web-scale modeling without manually crafted combinatorial features[C] // Proc of the 22nd ACM SIGKDD Int Conf on Knowledge Discovery and Data Mining. New York: ACM, 2016: 255−262
|
[7] |
Zhao Weijie, Xie Deping, Jia Ronglai, et al. Distributed hierarchical GPU parameter server for massive scale deep learning ads systems [C/OL] // Proc of the 3rd Conf on Machine Learning and Systems. 2020: 412-428[2023-03-11].https://proceedings.mlsys.org/paper_files/paper/2020/hash/6e426f4c16c6677a605375ae2e4877d5-Abstract.html
|
[8] |
Kurniawan D H, Wang Ruipu, Zulkifli K S, et al. EVSTORE: Storage and caching capabilities for scaling embedding tables in deep recommendation systems [C] // Proc of the 28th ACM Int Conf on Architectural Support for Programming Languages and Operating Systems. New York: ACM, 2023: 281–294
|
[9] |
Xie Minhui, Lu Youyou, Wang Qing, et al. PetPS: Supporting huge embedding models with persistent memory[C] // Proc of the 49th Int Conf on Very Large Data Bases. New York: ACM, 2023: 1013−1022
|
[10] |
Eisenman A, Naumov M, Gardner D, et al. Bandana: Using non-volatile memory for storing deep learning models [C/OL] // Proc of the 2nd Conf on Machine Learning and Systems. 2019: 40−52[2023-03-11].https://proceedings.mlsys.org/paper_files/paper/2019/hash/d59a1dc497cf2773637256f50f492723-Abstract.html
|
[11] |
Wei Yingcan, Langer M, Yu Fan, et al. A GPU-specialized inference parameter server for large-scale deep recommendation models[C] // Proc of the 16th ACM Conf on Recommender Systems. New York: ACM, 2022: 408−419
|
[12] |
NVIDIA. Virtual memory management [EB/OL]. [2023-03-11].https://docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__VA.html
|
[13] |
Intel. Storage performance development kit [EB/OL]. [2023-03-11].https://spdk.io
|
[14] |
刘知远,孙茂松,林衍凯,等. 知识表示学习研究进展[J]. 计算机研究与发展,2016,53(2):247−261
Liu Zhiyuan, Sun Maosong, Lin Yankai, et al. Knowledge representation learning: A review[J]. Journal of Computer Research and Development, 2016, 53(2): 247−261 (in Chinese)
|
[15] |
Guo Huifeng, Tang Ruiming, Ye Yunming, et al. DeepFM: A factorization-machine based neural network for CTR prediction[J]. arXiv preprint, arXiv: 1703. 04247, 2017
|
[16] |
Wang Ruoxi, Fu Bin, Fu Gang, et al. Deep & cross network for ad click predictions [C/OL] // Proc of the 8th Int Workshop on Data Mining for Online Advertising. New York: ACM, 2017[2023-03-11].https://dl.acm.org/doi/10.1145/3124749.3124754
|
[17] |
Zhou Guorui, Zhu Xiaoqiang, Song Chenru, et al. Deep interest network for click-through rate prediction[C] // Proc of the 24th ACM SIGKDD Int Conf on Knowledge Discovery & Data Mining. New York: ACM, 2018: 1059−1068
|
[18] |
Zhou Guorui, Mou Na, Fan Ying, et al. Deep interest evolution network for click-through rate prediction[C] // Proc of the 32nd AAAI Conf on Artificial Intelligence. Piscataway, NJ: IEEE, 2019: 5941−5948
|
[19] |
Jiang Wenqi, He Zhenhao, Zhang Shuai, et al. FleetRec: Large-scale recommendation inference on hybrid GPU-FPGA clusters[C] // Proc of the 27th ACM SIGKDD Conf on Knowledge Discovery & Data Mining. New York: ACM, 2021: 3097−3105
|
[20] |
Lian Xiangru, Yuan Binhang, Zhu Xuefeng, et al. Persia: A hybrid system scaling deep learning based recommenders up to 100 trillion parameters[J]. arXiv preprint, arXiv: 2111. 05897, 2021
|
[21] |
Ardestani E K, Kim C, Lee S J, et al. Supporting massive DLRM inference through software defined memory[C] // Proc of the 42nd IEEE Int Conf on Distributed Computing Systems (ICDCS). Piscataway, NJ: IEEE, 2022: 302−312
|
[22] |
Wan Hu, Sun Xuan, Cui Yufei, et al. FlashEmbedding: Storing embedding tables in SSD for large-scale recommender systems[C] // Proc of the 12th ACM SIGOPS Asia-Pacific Workshop on Systems. New York: ACM, 2021: 9−16
|
[23] |
Wilkening M, Gupta U, Hsia S, et al. RecSSD: Near data processing for solid state drive based recommendation inference[C] // Proc of the 26th ACM Int Conf on Architectural Support for Programming Languages and Operating Systems. New York: ACM, 2021: 717−729
|
[24] |
Zhao M, Agarwal N, Basant A, et al. Understanding data storage and ingestion for large-scale deep recommendation model training: Industrial product[C] // Proc of the 49th Annual Int Symp on Computer Architecture. New York: ACM, 2022: 1042−1057
|
[25] |
Lee Y, Seo S H, Choi H, et al. MERCI: Efficient embedding reduction on commodity hardware via sub-query memoization[C] // Proc of the 26th ACM Int Conf on Architectural Support for Programming Languages and Operating Systems. New York: ACM, 2021: 302−313
|
[26] |
Facebook. Facebook/RocksDB: A library that provides an embeddable, persistent key-value store for fast storage [EB/OL]. [2023-03-11].https://github.com/facebook/rocksdb
|
[27] |
Xie Minhui, Lu Youyou, Lin Jiazhen, et al. Fleche: An efficient GPU embedding cache for personalized recommendations[C] // Proc of the 17th European Conf on Computer Systems. New York: ACM, 2022: 402−416
|
[28] |
Jiang Wenqi, He Zhenhao, Zhang Shuai, et al. MicroRec: Efficient recommendation inference by hardware and data structure solutions [C/OL]. Proc of the 4th Conf on Machine Learning and Systems. 2021: 845−859[2023-03-11].https://proceedings.mlsys.org/paper_files/paper/2021/hash/9e9a5486cb2f8e44d5b5fedd2a9e5fcd-Abstract.html
|
[29] |
Ke Liu, Gupta U, Cho B Y, et al. RecNMP: Accelerating personalized recommendation with near-memory processing[C] // Proc of the 47th ACM/IEEE Annual Int Symp on Computer Architecture (ISCA). Piscataway, NJ: IEEE 2020: 790−803
|
[30] |
Kwon Y, Lee Y, Rhu M. TensorDIMM: A practical near-memory processing architecture for embeddings and tensor operations in deep learning[C] // Proc of the 52nd Annual IEEE/ACM Int Symp on Microarchitecture. New York: ACM, 2019: 740−753
|
[31] |
Asgari B, Hadidi R, Cao J, et al. FAFNIR: Accelerating sparse gathering by using efficient near-memory intelligent reduction[C] // Proc of the 27th IEEE Int Symp on High-Performance Computer Architecture (HPCA). Piscataway, NJ: IEEE 2021: 908−920
|
[32] |
Min S W, Mailthody V S, Qureshi Z, et al. EMOGI: Efficient memory-access for out-of-memory graph-traversal in GPUs[J]. arXiv preprint, arXiv: 2006. 06890, 2020
|
[33] |
Min S W, Wu Kun, Huang Sitao, et al. Large graph convolutional network training with GPU-oriented data communication architecture[J]. arXiv preprint, arXiv: 2103. 03330, 2021
|
[34] |
Qureshi Z, Mailthody V S, Gelado I, et al. BaM: A case for enabling fine-grain high throughput GPU-orchestrated access to storage[J]. arXiv preprint, arXiv: 2203. 04910, 2022
|
[1] | Wei Xuechao, Zhou Zhe, Xu Yinghui, Zhang Jiejing, Xie Yuan, Sun Guangyu. PetS: A Scalable Inference Serving System for Parameter-Efficient Transformers[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202440206 |
[2] | Wang Qing, Li Junru, Shu Jiwu. Survey on In-Network Storage Systems[J]. Journal of Computer Research and Development, 2023, 60(11): 2681-2695. DOI: 10.7544/issn1000-1239.202220865 |
[3] | Cai Tao, Wang Jie, Niu Dejiao, Liu Peiyao, Chen Fuli. A High Throughput NVM Storage System Based on Access Request Conflict Detection[J]. Journal of Computer Research and Development, 2020, 57(2): 257-268. DOI: 10.7544/issn1000-1239.2020.20190526 |
[4] | Chen Youmin, Lu Youyou, Luo Shengmei, Shu Jiwu. Survey on RDMA-Based Distributed Storage Systems[J]. Journal of Computer Research and Development, 2019, 56(2): 227-239. DOI: 10.7544/issn1000-1239.2019.20170849 |
[5] | Niu Dejiao, He Qingjian, Cai Tao, Wang Jie, Zhan Yongzhao, Liang Jun. APMSS: The New Solid Storage System with Asymmetric Interface[J]. Journal of Computer Research and Development, 2018, 55(9): 2083-2093. DOI: 10.7544/issn1000-1239.2018.20180198 |
[6] | Zhu Ping. An Access Pass-Through Policy of Storage Unit Under HPC Mass Storage System[J]. Journal of Computer Research and Development, 2013, 50(8): 1667-1673. |
[7] | Fu Yingxun, Luo Shengmei, Shu Jiwu. Survey of Secure Cloud Storage System and Key Technologies[J]. Journal of Computer Research and Development, 2013, 50(1): 136-145. |
[8] | Lu Youyou, Shu Jiwu. Survey on Flash-Based Storage Systems[J]. Journal of Computer Research and Development, 2013, 50(1): 49-59. |
[9] | Luo Xianghong and Shu Jiwu. Summary of Research for Erasure Code in Storage System[J]. Journal of Computer Research and Development, 2012, 49(1): 1-11. |
[10] | Wan Jiguang and Xie Changsheng. Research and Design of a Cluster Multimedia Storage System[J]. Journal of Computer Research and Development, 2006, 43(8): 1311-1316. |