ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2017, Vol. 54 ›› Issue (2): 258-266.doi: 10.7544/issn1000-1239.2017.20160939

所属专题: 2017科学大数据管理专题

• 软件技术 • 上一篇    下一篇

高能物理大数据挑战与海量事例特征索引技术研究

程耀东1,张潇2,王培建2,查礼3,侯迪2,齐勇2,马灿4   

  1. 1(中国科学院高能物理研究所 北京 100049); 2(西安交通大学计算机科学与技术系 西安 710049); 3(中国科学院计算技术研究所 北京 100190); 4(中国科学院信息工程研究所 北京 100093) (chyd@ihep.ac.cn)
  • 出版日期: 2017-02-01
  • 基金资助: 
    国家重点研发计划项目(2016YFB1000604)

Data Management Challenges and Event Index Technologies in High Energy Physics

Cheng Yaodong1, Zhang Xiao2, Wang Peijian2, Zha Li3, Hou Di2, Qi Yong2, Ma Can4   

  1. 1(Institute of High Energy Physics, Chinese Academy of Sciences, Beijing 100049);2(Department of Computer Science and Technology, Xi'an Jiaotong University, Xi'an 710049);3(Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190);4(Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100193)
  • Online: 2017-02-01

摘要: 新一代高能物理实验装置的建成与运行,产生了PB乃至EB量级的数据,这对数据采集、存储、传输与共享、分析与处理等数据管理技术提出了巨大挑战.事例是高能物理实验的基本数据单元,一次大型实验即可产生万亿级的事例.传统高能物理数据处理以ROOT文件为基本存储和处理单位,每个ROOT文件可以包含数千至数亿个事例.这种基于文件的处理方式虽然降低了高能物理数据管理系统的开发难度,但物理分析仅对极少量的稀有事例感兴趣,这导致了数据传输量大、I/O瓶颈以及数据处理效率低等问题.提出一种面向事例的高能物理数据管理方法,重点研究海量事例特征高效索引技术.在这种方法中,将物理学家感兴趣的事例的特征量抽取出来建立专门的索引,存储在NoSQL数据库中.为便于物理分析处理,事例的原始数据仍然存放在ROOT文件中.最后,通过系统验证和分析表明,基于事例特征索引进行事例筛选是可行的,优化后的HBase系统可以满足事例索引的需求.

关键词: 高能物理, 数据管理, 事例索引, HBase, 查询优化

Abstract: Nowadays, more and more scientific data has been produced by new generation high energy physics facilities. The scale of the data can be achieved to PB or EB level even by one experiment, which brings big challenges to data management technologies such as data acquisition, storage, transmission,sharing, analyzing and processing. Event is the basic data unit of high energy physics, and one large high energy physics experiment can produce trillions of events. The traditional high energy physical data processing technology adopts file as a basic data management unit, and each file contains thousands of events. The benefit of file-based method is to simplify the complexity of data management system. However, one physical analysis task is only interested in very few events, which leads to some problems including transferring too much redundant data, I/O bottleneck and low efficiency of data processing. To solve these problems, this paper proposes an event-oriented high energy physical data management method, which focuses on high efficiency indexing technology of massive events. In this method, event data is still stored in ROOT file while a large amount of events are indexed by some specified properties and stored in NoSQL database. Finally,experimental test results show the feasibility of the method, and optimized HBase system can meet the requirements of event index.

Key words: high energy physics, data management, event index, HBase, query optimization

中图分类号: