计算机研究与发展 ›› 2020, Vol. 57 ›› Issue (1): 102-123.doi: 10.7544/issn1000-1239.2020.20180675

  华东师范大学上海市高可信计算重点实验室 上海 200062);2(上海外国语大学国际金融贸易学院 上海 200083);3(南京理工大学计算机科学与技术学院 南京 210094);4(迪肯大学信息技术学院 澳大利亚墨尔本 VIC 3125)
  • 出版日期: 2020-01-01
  • 基金资助: 

Reliability in Cloud Computing System: A Review

Duan Wenxue1, Hu Ming1, Zhou Qiong2, Wu Tingming1, Zhou Junlong3, Liu Xiao4, Wei Tongquan1, Chen Mingsong1   

  1. 1(Shanghai Key Laboratory of Trustworthy Computing, East China Normal University, Shanghai 200062);2(School of Economics and Finance, Shanghai International Studies University, Shanghai 200083);3(School of Computer Science and Technology, Nanjing University of Science and Technology, Nanjing 210094);4(School of Information Technology, Deakin University, Melbourne, Australia VIC 3125)
  • Online: 2020-01-01
  • Supported by: 
    This work was supported by the National Key Research and Development Program of China (2018YFB2101300) and the National Natural Science Foundation of China (61872147).

摘要: 云计算作为一种新型计算模式,已经受到了学术界和工业界的广泛关注.基于资源虚拟化技术,云计算能够以按需使用、按使用量付费的方式为用户提供基础设施、平台、软件等服务.因此,越来越多的企业和组织选择云计算来部署他们的科学或商业应用.然而,随着用户数量的不断增加,数据中心的规模在迅速扩大、架构变得日益复杂,导致云计算系统的运行故障频繁发生,造成了巨大的损失.因此在规模巨大、架构复杂的云计算系统中,如何保障系统的可靠性已经成为一个极具挑战性的问题.针对云计算可靠性问题,概述了云计算系统中常见的各种故障,并详细描述了目前云计算中提高可靠性关键的故障管理技术;由于故障管理技术的应用会不可避免地增加系统的能耗,因此介绍了云计算中可靠性与能耗权衡问题的研究现状;最后列举了当前云计算可靠性研究中存在的主要挑战.

关键词: 云计算, 虚拟化, 可靠性, 故障管理, 能耗

Abstract: As a new computing paradigm, cloud computing has attracts extensive concerns from both academic and industrial fields. Based on resource virtualization technology, cloud computing provides users with services in the forms of infrastructure, platform and software in a “pay-as-you-go” manner. In the meanwhile, since cloud computing provides highly scalable computing resources, more and more enterprises and organizations choose cloud computing platforms to deploy their scientific or commercial applications. However, with the increasing number of cloud users, cloud data centers continuously expand and the architecture becomes increasingly complex, leading to growing runtime failures in cloud computing systems. Therefore, how to ensure the system reliability in cloud computing systems with large scale and complex architecture has become a huge challenge. This paper first summarizes various failures in cloud systems, introduces several methods to evaluate the reliability of cloud computing, and describes some key fault management mechanisms. Since fault management techniques inevitably increase energy consumption of cloud systems, this paper reviews current researches on the trade-off between reliability and energy efficiency in cloud computing. In the end, we propose some major challenges in current research of cloud computing reliability and concludes our paper.

Key words: cloud computing, virtualization, reliability, fault management, energy consumption