ISSN 1000-1239 CN 11-1777/TP

Journal of Computer Research and Development ›› 2021, Vol. 58 ›› Issue (1): 164-177.doi: 10.7544/issn1000-1239.2021.20190723

Previous Articles     Next Articles

Accelerating Byzantine Fault Tolerance with In-Network Computing

Yang Fan1,2, Zhang Peng1,2, Wang Zhan1, Yuan Guojun1, An Xuejun1   

  1. 1(Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190);2(University of Chinese Academy of Sciences, Beijing 100049)
  • Online:2021-01-01
  • Supported by: 
    This work was supported by the National Key Research and Development Program of China (2018YFB0204400, 2016YFB0200205), the National Natural Science Foundation of China for Young Scientists (61702484), and the Strategic Priority Research Program of the Chinese Academy of Sciences (class B) (XDB24050100).

Abstract: Byzantine fault tolerance algorithm is one kind of fault-tolerant algorithms which can tolerate various software errors and system vulnerabilities. It is of vital importance to the reliability of cloud computing. Compared with other fault-tolerant algorithms, such as proof-of-work (PoW), Byzantine fault tolerance algorithm is much more stable, however, its poor performance cannot meet the demand of cloud computing which requires high throughput and low latency. In-network computing is a data-centric architecture that uses the network to perform some calculations. Using in-network computing, data can be processed as it moves, thereby improving system performance. To solve the performance problem of Byzantine fault tolerant system, in this paper, we propose a Byzantine fault tolerance algorithm optimization strategy with in-network computing, which offloads some of the computational tasks to the network interface card (NIC). The processor and NIC form a multi-stage pipeline which helps us improve the system throughput. Simply using in-network computing can not meet the performance goals in all scenarios, hence we utilize multi-threading technology to scale the system. We evaluate our method on real testbed, and the experimental results show that, compared with the default Byzantine fault tolerant system, we can obtain 46% improvement in overall throughput and 65% decrease in latency. The results have proved our solution to be available and effective.

Key words: distributed system, Byzantine fault tolerant algorithms, in-network computing, accelerator, high performance computing

CLC Number: