高级检索

    高能物理大数据挑战与海量事例特征索引技术研究

    Data Management Challenges and Event Index Technologies in High Energy Physics

    • 摘要: 新一代高能物理实验装置的建成与运行,产生了PB乃至EB量级的数据,这对数据采集、存储、传输与共享、分析与处理等数据管理技术提出了巨大挑战.事例是高能物理实验的基本数据单元,一次大型实验即可产生万亿级的事例.传统高能物理数据处理以ROOT文件为基本存储和处理单位,每个ROOT文件可以包含数千至数亿个事例.这种基于文件的处理方式虽然降低了高能物理数据管理系统的开发难度,但物理分析仅对极少量的稀有事例感兴趣,这导致了数据传输量大、I/O瓶颈以及数据处理效率低等问题.提出一种面向事例的高能物理数据管理方法,重点研究海量事例特征高效索引技术.在这种方法中,将物理学家感兴趣的事例的特征量抽取出来建立专门的索引,存储在NoSQL数据库中.为便于物理分析处理,事例的原始数据仍然存放在ROOT文件中.最后,通过系统验证和分析表明,基于事例特征索引进行事例筛选是可行的,优化后的HBase系统可以满足事例索引的需求.

       

      Abstract: Nowadays, more and more scientific data has been produced by new generation high energy physics facilities. The scale of the data can be achieved to PB or EB level even by one experiment, which brings big challenges to data management technologies such as data acquisition, storage, transmission,sharing, analyzing and processing. Event is the basic data unit of high energy physics, and one large high energy physics experiment can produce trillions of events. The traditional high energy physical data processing technology adopts file as a basic data management unit, and each file contains thousands of events. The benefit of file-based method is to simplify the complexity of data management system. However, one physical analysis task is only interested in very few events, which leads to some problems including transferring too much redundant data, I/O bottleneck and low efficiency of data processing. To solve these problems, this paper proposes an event-oriented high energy physical data management method, which focuses on high efficiency indexing technology of massive events. In this method, event data is still stored in ROOT file while a large amount of events are indexed by some specified properties and stored in NoSQL database. Finally,experimental test results show the feasibility of the method, and optimized HBase system can meet the requirements of event index.

       

    /

    返回文章
    返回