Citation: | Wang Qing, Li Junru, Shu Jiwu. Survey on In-Network Storage Systems[J]. Journal of Computer Research and Development, 2023, 60(11): 2681-2695. DOI: 10.7544/issn1000-1239.202220865 |
Programmable network devices, represented by programmable switches and SmartNICs, are increasingly used in modern data centers to support the execution of customized data processing logic on network data transmission paths, which brings new opportunities for building high-performance in-network storage systems. However, programmable network devices have hardware resource limitations (e.g., limited expressive powers and small memory space), and there are still many challenges to fully utilize their advantages and maximize the acceleration of storage systems. We systematically review the recent research progress of in-network storage systems. First, we describe the hardware architecture and performance characteristics of programmable network devices, and based on this, we summarize two major challenges in building high-performance in-network storage systems: 1) division of labor between hardware and software, 2) fault tolerance of the storage systems. Then, according to the tasks performed by programmable network devices (data caching, distributed coordination, request scheduling, data aggregation), we classify and describe existing in-network storage systems. Moreover, using several examples of in-network storage systems, we analyze corresponding design difficulties and software technologies. Finally, we indicate open problems that need to be explored in further research on in-network storage systems, including switch-NIC collaboration, data security, multi-tenancy, and automatic function offloading.
[1] |
Seagate. The digitization of the world: From edge to core [EB/OL]. [2022-09-20].https://www.seagate.com/files/www-content/our-story/trends/files/idc-seagate-dataage-whitepaper.pdf
|
[2] |
Nvidia. ConnectX-6 [EB/OL]. [2022-09-20].https://www.nvidia.com/en-us/networking/ethernet/connectx-6/
|
[3] |
Bosshart P, Gibb G, Kim H S, et al. Forwarding metamorphosis: Fast programmable match-action processing in hardware for SDN[J]. ACM SIGCOMM Computer Communication Review, 2013, 43(4): 99−110 doi: 10.1145/2534169.2486011
|
[4] |
Intel. Intel Tofino intelligent fabric processors [EB/OL]. [2022-09-20].https://www.intel.com/content/www/us/en/products/network-io/programmable-ethernet-switch/tofino-3-product-brochure.html
|
[5] |
NVIDIA. NVIDIA BlueField data processing units [EB/OL]. [2022-09-20].https://www.nvidia.com/en-us/networking/products/data-processing-unit/
|
[6] |
Netronome. Agilio CX SmartNICs [EB/OL]. [2022-09-20].https://www.netronome.com/products/agilio-cx/
|
[7] |
Nvidia. ConnectX SmartNICs [EB/OL]. [2022-09-20].https://www.nvidia.com/en-us/networking/ethernet/innova-2-flex/
|
[8] |
Nvidia. Innova-2 Flex [EB/OL]. [2022-09-20].https://www.nvidia.com/en-au/networking/ethernet-adapters/
|
[9] |
马潇潇,杨帆,王展,等. 智能网卡综述[J]. 计算机研究与发展,2022,59(1):1−21
Ma Xiaoxiao, Yang Fan, Wang Zhan, et al. Survey on smart network interface card[J]. Journal of Computer Research and Development, 2022, 59(1): 1−21 (in Chinese)
|
[10] |
Wang Qing, Lu Youyou, Xu Erci, et al. Concordia: Distributed shared memory with in-network cache coherence[C]//Proc of the 19th USENIX Conf on File and Storage Technologies. Berkeley, CA: USENIX Association, 2021: 277−292
|
[11] |
Chole S, Fingerhut A, Ma Sha, et al. dRMT: Disaggregated programmable switching[C/OL]//Proc of the ACM Special Interest Group on Data Communication. New York: ACM, 2017 [2023-02-09].https://doi.org/10.1145/3098822.3098823
|
[12] |
Shrivastav V. Stateful multi-pipelined programmable switches[C]//Proc of the ACM Special Interest Group on Data Communication. New York: ACM, 2022: 663−676
|
[13] |
Kim D, Liu Zaoxing, Zhu Yibo, et al. TEA: Enabling state-intensive network functions on programmable switches[C]//Proc of the ACM Special Interest Group on Data Communication. New York: ACM, 2020: 90−106
|
[14] |
Yuan Yifan, Alama O, Fei Jiawei, et al. Unlocking the power of inline floating-point operations on programmable switches[C]//Proc of the 19th USENIX Symp on Networked Systems Design and Implementation. Berkeley, CA: USENIX Association, 2022: 683−700
|
[15] |
Sivaraman A, Cheung A, Budiu M, et al. Packet transactions: High-level programming for line-rate switches[C]//Proc of the ACM Special Interest Group on Data Communication. New York: ACM, 2016: 15−28
|
[16] |
Jin Xin, Li Xiaozhou, Zhang Haoyu, et al. NetCache: Balancing key-value stores with fast in-network caching[C]//Proc of the 26th Symp on Operating Systems Principles. New York: ACM, 2017: 121−136
|
[17] |
Kogias M, Prekas G, Ghosn A, et al. R2P2: Making RPCs first-class datacenter citizens[C]//Proc of the 44th USENIX Annual Technical Conf. Berkeley, CA: USENIX Association, 2019: 863−880
|
[18] |
Seemakhupt K, Liu Sihang, Senevirathne Y, et al. PMNet: In-network data persistence[C]//Proc of the 48th Annual Int Symp on Computer Architecture. Piscataway, NJ: IEEE, 2021: 804−817
|
[19] |
Kim D, Nelson J, Ports D R K, et al. RedPlane: Enabling fault-tolerant stateful in-switch applications[C]//Proc of the ACM Special Interest Group on Data Communication. New York: ACM, 2021: 223−244
|
[20] |
Li Bojie, Ruan Zhenyuan, Xiao Wencong, et al. KV-Direct: High-performance in-memory key-value store with programmable NIC[C]//Proc of the 26th Symp on Operating Systems Principles. New York: ACM, 2017: 137−152
|
[21] |
Li Junru, Lu Youyou, Zhang Yiming, et al. SwitchTx: Scalable in-network coordination for distributed transaction processing[C]//Proc of the 48th Int Conf on Very Large Databases. New York: ACM, 2022: 2881−2894
|
[22] |
Li Junru, Lu Youyou, Wang Qing, et al. AlNiCo: SmartNIC-accelerated contention-aware request scheduling for transaction processing[C]//Proc of the 47th USENIX Annual Technical Conf. Berkeley, CA: USENIX Association, 2022: 951−966
|
[23] |
Qiao Yi, Kong Xiao, Zhang Menghao, et al. Towards in-network acceleration of erasure coding[C]//Proc of the Symp on SDN Research. New York: ACM, 2020: 41−47
|
[24] |
Fan Bin, Lim H, Andersen D G, et al. Small cache, big effect: Provable load balancing for randomly partitioned cluster services[C]//Proc of the 2nd ACM Symp on Cloud Computing. New York: ACM, 2011: 264−275
|
[25] |
Cormode G, Muthukrishnan S. An improved data stream summary: The count-min sketch and its applications[J]. Journal of Algorithms, 2005, 55(1): 58−75 doi: 10.1016/j.jalgor.2003.12.001
|
[26] |
Luo Lailong, Guo Deke, Ma R T B, et al. Optimizing Bloom filter: Challenges, solutions, and comparisons[J]. IEEE Communications Surveys & Tutorials, 2018, 21(2): 1912−1949
|
[27] |
Liu Zaoxing, Bai Zhihao, Liu Zhenming, et al. DistCache: Provable load balancing for large-scale storage systems with distributed caching[C]//Proc of the 17th USENIX Conf on File and Storage Technologies. Berkeley, CA: USENIX Association, 2019: 143−157
|
[28] |
Jin Xin, Li Xiaozhou, Zhang Haoyu, et al. NetChain: Scale-free sub-RTT coordination[C]//Proc of the 15th USENIX Symp on Networked Systems Design and Implementation. Berkeley, CA: USENIX Association, 2018: 35−49
|
[29] |
Van Renesse R, Schneider F B. Chain replication for supporting high throughput and availability[C]// Proc of the 6th USENIX Symp on Operating Systems Design and Implementation. Berkeley, CA: USENIX Association, 2004: 91−104
|
[30] |
Sun Shangyi, Zhang Rui, Yan Ming, et al. SKV: A SmartNIC-offloaded distributed key-value store[C]//Proc of IEEE Int Conf on Cluster Computing. Piscataway, NJ: IEEE, 2022: 132−142
|
[31] |
Li Jialin, Nelson J, Michael E, et al. Pegasus: Tolerating skewed workloads in distributed storage with in-network coherence directories[C]//Proc of the 14th USENIX Symp on Operating Systems Design and Implementation. Berkeley, CA: USENIX Association, 2020: 387−406
|
[32] |
Lee S, Yu Yanpeng, Tang Yupeng, et al. Mind: In-network memory management for disaggregated data centers[C]//Proc of the 28th ACM SIGOPS Symp on Operating Systems Principles. New York: ACM, 2021: 488−504
|
[33] |
Yu Zhuolong, Zhang Yiwen, Braverman V, et al. NetLock: Fast, centralized lock management using programmable switches[C]//Proc of the ACM Special Interest Group on Data Communication. New York: ACM, 2020: 126−138
|
[34] |
Li Jialin, Michael E, Ports D R K. Eris: Coordination-free consistent transactions using in-network concurrency control[C]//Proc of the 26th Symp on Operating Systems Principles. New York: ACM, 2017: 104−120
|
[35] |
Schuh H N, Liang Weihao, Liu Ming, et al. Xenic: SmartNIC-Accelerated Distributed Transactions[C]//Proc of the 28th ACM SIGOPS Symp on Operating Systems Principles. New York: ACM, 2021: 740−755
|
[36] |
Cowling J, Liskov B. Granola: Low-overhead distributed transaction coordination[C]//Proc of the 37th USENIX Annual Technical Conf. Berkeley, CA: USENIX Association, 2012: 223−235
|
[37] |
Kung H T, Robinson J T. On optimistic methods for concurrency control[J]. ACM Transactions on Database Systems, 1981, 6(2): 213−226 doi: 10.1145/319566.319567
|
[38] |
Celis P, Larson P A, Munro J I. Robin hood hashing[C]//Proc of the 26th Annual Symp on Foundations of Computer Science. Piscataway, NJ: IEEE, 1985: 281−288
|
[39] |
Kim J, Jang I, Reda W, et al. LineFS: Efficient SmartNIC offload of a distributed file system with pipeline parallelism[C]//Proc of the 28th ACM SIGOPS Symp on Operating Systems Principles. New York: ACM, 2021: 756−771
|
[40] |
Zhu Hang, Kaffes K, Chen Zixu, et al. RackSched: A microsecond-scale scheduler for rack-scale computers[C]//Proc of the 14th USENIX Symp on Operating Systems Design and Implementation. Berkeley, CA: USENIX Association, 2020: 1225−1240
|
[41] |
Kaffes K, Chong T, Humphries J T, et al. Shinjuku: Preemptive scheduling for μ second-scale tail latency[C]//Proc of the 16th USENIX Symp on Networked Systems Design and Implementation. Berkeley, CA: USENIX Association, 2019: 345−360
|
[42] |
Zhu Hang, Bai Zhihao, Li Jialin, et al. Harmonia: Near-linear scalability for replicated storage with in-network conflict detection[C]//Proc of the 45th Int Conf on Very Large Databases. New York: ACM, 2019: 375−388
|
[43] |
Takruri H, Kettaneh I, Alquraan A, et al. FLAIR: Accelerating reads with consistency-aware network routing[C]//Proc of the 17th USENIX Symp on Networked Systems Design and Implementation. Berkeley, CA: USENIX Association, 2020: 723−737
|
[44] |
Plank J S. A tutorial on Reed-Solomon coding for fault-tolerance in RAID-like systems[J]. Software: Practice and Experience, 1997, 27(9): 995−1012 doi: 10.1002/(SICI)1097-024X(199709)27:9<995::AID-SPE111>3.0.CO;2-6
|
[45] |
Shvachko K, Kuang H, Radia S, et al. The Hadoop distributed file system[C]//Proc of the 26th Symp on Mass Storage Systems and Technologies. Piscataway, NJ: IEEE, 2010: 133−142
|
[46] |
Sapio A, Canini M, Ho C Y, et al. Scaling distributed machine learning with in-network aggregation[C]//Proc of the 18th USENIX Symp on Networked Systems Design and Implementation. Berkeley, CA: USENIX Association, 2021: 785−808
|
[47] |
Lao C L, Le Yanfang, Mahajan K, et al. ATP: In-network aggregation for multi-tenant learning[C]//Proc of the 18th USENIX Symp on Networked Systems Design and Implementation. Berkeley, CA: USENIX Association, 2021: 741−761
|
[48] |
Fei Jiawei, Ho C Y, Sahu A N, et al. Efficient sparse collective communication and its application to accelerate distributed deep learning[C]//Proc of the ACM Special Interest Group on Data Communication. New York: ACM, 2021: 676−691
|
[49] |
Li Youjie, Liu Iou-Jen, Yuan Yifan, et al. Accelerating distributed reinforcement learning with in-switch computing[C]//Proc of the 46th Annual Int Symp on Computer Architecture. Piscataway, NJ: IEEE, 2019: 279−291
|
[50] |
De Sensi D, Di Girolamo S, Ashkboos S, et al. Flare: Flexible in-network allreduce[C]//Proc of the Int Conf for High Performance Computing, Networking, Storage and Analysis. New York: ACM, 2021: 14−29
|
[51] |
Di Girolamo S, Kurth A, Calotoiu A, et al. A RISC-V in-network accelerator for flexible high-performance low-power packet processing[C]//Proc of the 48th Annual Int Symp on Computer Architecture. Piscataway, NJ: IEEE, 2021: 958−971
|
[52] |
Li Huancheng, Hao Mingzhe, Novakovic S, et al. LeapIO: Efficient and portable virtual NVMe storage on ARM socs[C]//Proc of the 25th Int Conf on Architectural Support for Programming Languages and Operating Systems. New York: ACM, 2020: 591−605
|
[53] |
Nishtala R, Fugal H, Grimm S, et al. Scaling Memcache at Facebook[C]//Proc of the 10th USENIX Symp on Networked Systems Design and Implementation. Berkeley, CA: USENIX Association, 2013: 385−398
|
[54] |
Weil S A, Brandt S A, Miller E L, et al. Ceph: A scalable, high-performance distributed file system[C]//Proc of the 7th USENIX Symp on Operating Systems Design and Implementation. Berkeley, CA: USENIX Association, 2006: 307−320
|
[1] | Wu Tianxing, Cao Xudong, Bi Sheng, Chen Ya, Cai Pingqiang, Sha Hangyu, Qi Guilin, Wang Haofen. Constructing Health Management Information System for Major Chronic Diseases Based on Large Language Model[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202440570 |
[2] | Zhao Yun, Liu Dexi, Wan Changxuan, Liu Xiping, Liao Guoqiong. Mental Health Text Matching Model Integrating Characters’ Mental Portrait[J]. Journal of Computer Research and Development, 2024, 61(7): 1812-1824. DOI: 10.7544/issn1000-1239.202220987 |
[3] | Fu Tao, Chen Zhaojiong, Ye Dongyi. GAN-Based Bidirectional Decoding Feature Fusion Extrapolation Algorithm of Chinese Landscape Painting[J]. Journal of Computer Research and Development, 2022, 59(12): 2816-2830. DOI: 10.7544/issn1000-1239.20210830 |
[4] | Gan Xinbiao, Tan Wen, Liu Jie. Bidirectional-Bitmap Based CSR for Reducing Large-Scale Graph Space[J]. Journal of Computer Research and Development, 2021, 58(3): 458-466. DOI: 10.7544/issn1000-1239.2021.20200090 |
[5] | Zhou Donghao, Han Wenbao, Wang Yongjun. A Fine-Grained Information Diffusion Model Based on Node Attributes and Content Features[J]. Journal of Computer Research and Development, 2015, 52(1): 156-166. DOI: 10.7544/issn1000-1239.2015.20130915 |
[6] | Li Yaxiong, Zhang Jianqiang, Pan Deng, Hu Dan. A Study of Speech Recognition Based on RNN-RBM Language Model[J]. Journal of Computer Research and Development, 2014, 51(9): 1936-1944. DOI: 10.7544/issn1000-1239.2014.20140211 |
[7] | Huang He, Sun Yu'e, Chen Zhili, Xu Hongli, Xing Kai, Chen Guoliang. Completely-Competitive-Equilibrium-Based Double Spectrum Auction Mechanism[J]. Journal of Computer Research and Development, 2014, 51(3): 479-490. |
[8] | Zhu Feng, Luo Limin, Song Yuqing, Chen Jianmei, Zuo Xin. Adaptive Spatially Neighborhood Information Gaussian Mixture Model for Image Segmentation[J]. Journal of Computer Research and Development, 2011, 48(11): 2000-2007. |
[9] | Ma Xiao, Wang Xuan, and Wang Xiaolong. The Information Model for a Class of Imperfect Information Game[J]. Journal of Computer Research and Development, 2010, 47(12). |
[10] | Ma Liang, Chen Qunxiu, and Cai Lianhong. An Improved Model for Adaptive Text Information Filtering[J]. Journal of Computer Research and Development, 2005, 42(1): 79-84. |