ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2015, Vol. 52 ›› Issue (2): 333-342.doi: 10.7544/issn1000-1239.2015.20140302

所属专题: 2015大数据管理

• 综述 • 上一篇    下一篇

大数据分析与高速数据更新

陈世敏   

  1. (计算机体系结构国家重点实验室(中国科学院计算技术研究所) 北京 100190) (chensm@ict.ac.cn)
  • 出版日期: 2015-02-01
  • 基金资助: 
    基金项目:中国科学院“百人计划”;国家自然科学基金创新研究群体项目(61221062)

Big Data Analysis and Data Velocity

Chen Shimin   

  1. (State Key Laboratory of Computer Architecture(Institute of Computing Technology, Chinese Academy of Sciences), Beijing 100190)
  • Online: 2015-02-01

摘要: 大数据对于数据管理系统平台的主要挑战可以归纳为volume(数据量大)、velocity(数据的产生、获取和更新速度快)和variety(数据种类繁多)3个方面.针对大数据分析系统,尝试解读velocity的重要性和探讨如何应对velocity的挑战.首先比较事物处理、数据流、与数据分析系统对velocity的不同要求.然后从数据更新与大数据分析系统相互关系的角度出发,讨论两项近期的研究工作:1)MaSM,在数据仓库系统中支持在线数据更新;2)LogKV,在日志处理系统中支持高速流入的日志数据和高效的基于时间窗口的连接操作.通过分析比较发现,存储数据更新只是最基本的要求,更重要的是应该把大数据的从更新到分析作为数据的整个生命周期,进行综合考虑和优化,根据大数据分析的特点,优化高速数据更新的数据组织和数据分布方式,从而保证甚至提高数据分析运算的效率.

关键词: 数据更新, 大数据分析, 数据仓库, 日志处理系统, 数据组织与分布算法

Abstract: Big data poses three main challenges to the underlying data management systems: volume (a huge amount of data), velocity (high speed of data generation, data acquisition, and data updates), and variety (a large number of data types and data formats). In this paper, we focus on understanding the significance of velocity and discussing how to face the challenge of velocity in the context of big data analysis systems. We compare the requirements of velocity in transaction processing, data stream, and data analysis systems. Then we describe two of our recent research studies with an emphasis on the role of data velocity in big data analysis systems: 1) MaSM, supporting online data updates in data warehouse systems; 2) LogKV, supporting high-throughput data ingestion and efficient time-window based joins in an event log processing system. Comparing the two studies, we find that storing incoming data updates is only the minimum requirement. We should consider velocity as an integral part of the data acquisition and analysis life cycle. It is important to analyze the characteristics of the desired big data analysis operations, and then to optimize data organization and data distribution schemes for incoming data updates so as to maintain or even improve the efficiency of big data analysis.

Key words: data updates, big data analysis, data warehouse, event log processing system, data organization and distribution

中图分类号: