基于开源生态系统的大数据平台研究

雷军; 叶航军; 武泽胜; 张鹏; 谢龙; 何炎祥

doi:10.7544/issn1000-1239.2017.20150492

基于开源生态系统的大数据平台研究

Big-Data Platform Based on Open Source Ecosystem

摘要

摘要: 大规模数据的收集和处理是近年的研究热点，业界已经提出了若干平台级的设计方案，大量使用了开源软件作为数据收集和处理组件.然而，要真正满足企业应用中海量数据存储、多样化业务处理、跨业务分析、跨环境部署等复杂需求，尚需设计具有完整性、通用性、支持整个数据生命周期管理的大数据平台，并且对开源软件进行大量的功能开发、定制和改进.从小米公司的行业应用和实践出发，在深入研究现有平台的基础上，提出了一种新的基于开源生态系统的大数据收集与处理平台，在负载均衡、故障恢复、数据压缩、多维调度等方面进行了大量优化，同时发现并解决了现有开源软件在数据收集、存储、处理以及软件一致性、可用性和效率等方面的缺陷.该平台已经在小米公司成功部署，为小米公司各个业务线的数据收集和处理提供支撑服务.

Abstract: As large-scale data collecting and processing are being widely studied in recent years, several released big data processing platforms are increasingly playing important roles in the operations of many Internet businesses. Open source ecosystems, the engine of big data innovation, have been evolving so rapidly that a number of them are successfully adopted as the components of mainstream data processing platforms. In reality, however, the open source software is still far from perfect while dealing with real large-scale data. On the basis of the industrial practice at Xiaomi Inc, this paper proposes an improved platform for collecting and processing large-scale data in face of varied business requirements. We focus on the problems in terms of the functionality, consistency and availability of the software when they are executed for data collecting, storing and processing procedures. In addition, we propose a series of optimizations aiming at load balance, failover, data compression and multi-dimensional scheduling to significantly improve the efficiency of the current system. All these designs and optimizations described in this paper have been practically implemented and deployed to support various Internet services provided by Xiaomi Inc.

HTML全文

参考文献(0)

施引文献

资源附件(0)