ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2017, Vol. 54 ›› Issue (2): 248-257.doi: 10.7544/issn1000-1239.2017.20170005

所属专题: 2017科学大数据管理专题

• 软件技术 • 上一篇    下一篇

天文大数据挑战与实时处理技术

杨晨1,翁祖建1,孟小峰1,任玮1,忻日辉1,王春凯1,都志辉2,万萌3,魏建彦3   

  1. 1(中国人民大学信息学院 北京 100872); 2(清华大学计算机科学与技术系 北京 100084); 3(中国科学院国家天文台 北京 100012) (yang_chen@ruc.edu.cn)
  • 出版日期: 2017-02-01
  • 基金资助: 
    国家重点研发计划项目(2016YFB1000602)

Data Management Challenges and Real-Time Processing Technologies in Astronomy

Yang Chen1, Weng Zujian1, Meng Xiaofeng1, Ren Wei1, Xin Rihui1, Wang Chunkai1, Du Zhihui2, Wan Meng3, Wei Jianyan3   

  1. 1(School of Information, Renmin University of China, Beijing 100872);2(Department of Computer Science and Technology, Tsinghua University, Beijing 100084);3(National Astronomical of Observatories, Chinese Academy of Sciences, Beijing 100012)
  • Online: 2017-02-01

摘要: 超大型天文观测技术的出现不仅能够让研究人员观测到新的天文现象,更能用于验证已有物理模型的正确性.这些最新天文成果的发现是建立在海量天文数据的近乎实时产生、管理与分析的基础上,因此给目前的数据管理系统带来了新的挑战.以我国自主研发的地基广角相机阵(the ground-based wide-angle camera array, GWAC)天文望远镜为例,15s的采样和处理周期都处于短时标观测领域的世界前列,但却对数据管理系统提出了很多问题,包括多镜头并行输出数据管理、实时瞬变源发现、当前观测夜数据的秒级查询、数据持久化和快速离线查询等.基于上述问题,设计了分布式GWAC数据模拟生成器用于模拟真实GWAC数据产生场景,并基于产生的数据特性,提出一种两级缓存架构,使用本地内存解决多镜头并行输出、实时瞬变源发现,使用分布式共享内存实现秒级查询.为了平衡持久化和查询效率,设计一种星表簇结构将整个星表数据划分后聚集存储.根据天文需求特点,设计基于索引表的查询引擎能从缓存和星表簇以较小的代价对星表数据查询.通过实验验证,当前方案能够满足GWAC的需求.

关键词: 天文大数据管理, 地基广角相机阵, 两级缓存, 星表簇, 索引表

Abstract: In recent years, many large telescopes, which can produce petabytes or exabytes of data, have come out. These telescopes are not only beneficial to the find of new astronomical phenomena, but also the confirmation of existing astronomical physical models. However, the produced star tables are so large that the single database cannot manage them efficiently. Taking GWAC that has 40 cameras and is designed by China as an example, it can take high-resolution photos by 15s and the database on it has to make star tables be queried out in 15s. Moreover, the database has to process multi-camera data, find abnormal stars in real time, query their recent historical data very fast, persist and offline query star tables as fast as possible. Based on these problems, firstly, we design a distributed data generator to simulate the GWAC working process. Secondly, we address a two-level cache architecture which cannot only process multi-camera data and find abnormal stars in local memory, but also query star table in a distributed memory system. Thirdly, we address a storage format named star cluster, which can storage some stars into a physical file to trade off the efficiency of persistence and query. Last, our query engine based on an index table can query from the second cache and star cluster format. The experimental results show our distributed system prototype can satisfy the demand of GWAC in our server cluster.

Key words: astronomy big data management, the ground-based wide-angle camera array (GWAC), two-level cache, star cluster, index table

中图分类号: