ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2021, Vol. 58 ›› Issue (1): 164-177.doi: 10.7544/issn1000-1239.2021.20190723

• 网络技术 • 上一篇    下一篇

基于在网计算加速的拜占庭容错算法

杨帆1,2, 张鹏1,2, 王展1, 元国军1, 安学军1   

  1. 1(中国科学院计算技术研究所 北京 100190);2(中国科学院大学 北京 100049) (yangfan@ncic.ac.cn)
  • 出版日期: 2021-01-01
  • 基金资助: 
    国家重点研发计划项目(2018YFB0204400,2016YFB0200205);国家自然科学基金青年基金项目(61702484);中国科学院战略性先导科技专项(B类)项目(XDB24050100)

Accelerating Byzantine Fault Tolerance with In-Network Computing

Yang Fan1,2, Zhang Peng1,2, Wang Zhan1, Yuan Guojun1, An Xuejun1   

  1. 1(Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190);2(University of Chinese Academy of Sciences, Beijing 100049)
  • Online: 2021-01-01
  • Supported by: 
    This work was supported by the National Key Research and Development Program of China (2018YFB0204400, 2016YFB0200205), the National Natural Science Foundation of China for Young Scientists (61702484), and the Strategic Priority Research Program of the Chinese Academy of Sciences (class B) (XDB24050100).

摘要: 拜占庭容错算法是一类能够容忍各种形式的软件错误和安全漏洞的容错算法,对云计算的可靠性保障有着重要意义.与其他容错算法相比,拜占庭容错算法稳定性更高,但是其性能表现低下,不能满足当前系统对高吞吐、低延时的需求.在网计算是一种以数据为中心的体系结构,它用网络承担部分计算功能,使数据在流动过程中获得处理,从而提高系统性能.为解决拜占庭容错系统的问题,提出了一种基于在网计算的拜占庭容忍共识算法优化方案,将算法的一部分处理任务卸载到网卡上执行,利用网卡和处理器形成的多级流水线提升系统吞吐量.由于仅使用在网计算的方案在特定场景下效果不佳,因此,使用多线程方法来提升优化方案的可扩展性.同时,对算法进行了详细的系统评测,实验结果表明:相对于普通的拜占庭容错系统,使用在网计算与多线程结合的优化方案能够获得46%的吞吐率提升以及65%的延迟下降,证明了基于在网计算的拜占庭容忍共识算法优化方案的可行性与有效性.

关键词: 分布式系统, 拜占庭容错算法, 在网计算, 加速器, 高性能计算

Abstract: Byzantine fault tolerance algorithm is one kind of fault-tolerant algorithms which can tolerate various software errors and system vulnerabilities. It is of vital importance to the reliability of cloud computing. Compared with other fault-tolerant algorithms, such as proof-of-work (PoW), Byzantine fault tolerance algorithm is much more stable, however, its poor performance cannot meet the demand of cloud computing which requires high throughput and low latency. In-network computing is a data-centric architecture that uses the network to perform some calculations. Using in-network computing, data can be processed as it moves, thereby improving system performance. To solve the performance problem of Byzantine fault tolerant system, in this paper, we propose a Byzantine fault tolerance algorithm optimization strategy with in-network computing, which offloads some of the computational tasks to the network interface card (NIC). The processor and NIC form a multi-stage pipeline which helps us improve the system throughput. Simply using in-network computing can not meet the performance goals in all scenarios, hence we utilize multi-threading technology to scale the system. We evaluate our method on real testbed, and the experimental results show that, compared with the default Byzantine fault tolerant system, we can obtain 46% improvement in overall throughput and 65% decrease in latency. The results have proved our solution to be available and effective.

Key words: distributed system, Byzantine fault tolerant algorithms, in-network computing, accelerator, high performance computing

中图分类号: