ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2017, Vol. 54 ›› Issue (1): 80-93.doi: 10.7544/issn1000-1239.2017.20150492

• 人工智能 • 上一篇    下一篇

基于开源生态系统的大数据平台研究

雷军1,2,叶航军2,武泽胜2,张鹏2,谢龙2,何炎祥1,3   

  1. 1(武汉大学计算机学院 武汉 430072); 2(小米科技有限责任公司 北京 100085); 3(软件工程国家重点实验室(武汉大学) 武汉 430072) (leijun@xiaomi.com)
  • 出版日期: 2017-01-01
  • 基金资助: 
    国家自然科学基金项目(91118003,61373039,61170022) This work was supported by the National Natural Science Foundation of China (91118003, 61373039, 61170022).

Big-Data Platform Based on Open Source Ecosystem

Lei Jun1,2, Ye Hangjun2, Wu Zesheng2, Zhang Peng2, Xie Long2, He Yanxiang1,3   

  1. 1(Computer School, Wuhan University, Wuhan 430072); 2(Xiaomi Inc, Beijing 100085); 3(State Key Laboratory of Software Engineering (Wuhan University), Wuhan 430072)
  • Online: 2017-01-01

摘要: 大规模数据的收集和处理是近年的研究热点,业界已经提出了若干平台级的设计方案,大量使用了开源软件作为数据收集和处理组件.然而,要真正满足企业应用中海量数据存储、多样化业务处理、跨业务分析、跨环境部署等复杂需求,尚需设计具有完整性、通用性、支持整个数据生命周期管理的大数据平台,并且对开源软件进行大量的功能开发、定制和改进.从小米公司的行业应用和实践出发,在深入研究现有平台的基础上,提出了一种新的基于开源生态系统的大数据收集与处理平台,在负载均衡、故障恢复、数据压缩、多维调度等方面进行了大量优化,同时发现并解决了现有开源软件在数据收集、存储、处理以及软件一致性、可用性和效率等方面的缺陷.该平台已经在小米公司成功部署,为小米公司各个业务线的数据收集和处理提供支撑服务.

关键词: Hadoop, 开源生态系统, 大数据, 数据中心, 网络虚拟化

Abstract: As large-scale data collecting and processing are being widely studied in recent years, several released big data processing platforms are increasingly playing important roles in the operations of many Internet businesses. Open source ecosystems, the engine of big data innovation, have been evolving so rapidly that a number of them are successfully adopted as the components of mainstream data processing platforms. In reality, however, the open source software is still far from perfect while dealing with real large-scale data. On the basis of the industrial practice at Xiaomi Inc, this paper proposes an improved platform for collecting and processing large-scale data in face of varied business requirements. We focus on the problems in terms of the functionality, consistency and availability of the software when they are executed for data collecting, storing and processing procedures. In addition, we propose a series of optimizations aiming at load balance, failover, data compression and multi-dimensional scheduling to significantly improve the efficiency of the current system. All these designs and optimizations described in this paper have been practically implemented and deployed to support various Internet services provided by Xiaomi Inc.

Key words: Hadoop, open source ecosystem, big data, data center, network virtualization

中图分类号: