ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2019, Vol. 56 ›› Issue (7): 1567-1577.doi: 10.7544/issn1000-1239.2019.20180792

• 软件技术 • 上一篇    下一篇

面向时间序列大数据海量并行贝叶斯因子化分析方法

高腾飞,刘勇琰,汤云波,张垒,陈丹   

  1. (武汉大学计算机学院 武汉 430072) (gaotengfei@whu.edu.cn)
  • 出版日期: 2019-07-01
  • 基金资助: 
    国家自然科学基金项目(61772380);湖北省自然科学基金创新群体项目(2017CFA007)

A Massively Parallel Bayesian Approach to Factorization-Based Analysis of Big Time Series Data

Gao Tengfei, Liu Yongyan, Tang Yunbo, Zhang Lei, Chen Dan   

  1. (School of Computer Science, Wuhan University, Wuhan 430072)
  • Online: 2019-07-01

摘要: 时间序列大数据记录着复杂系统在时间和空间上大尺度的演化过程,详细描述了系统不同部分之间的相互作用和相互联系.提取时间序列大数据中潜在的低维因子对研究复杂系统的整体机制有着至关重要的作用.大数据的超高维和大尺度导致许多传统因子分析方法难以适应,先验知识缺乏更增加了研究难度.针对这一巨大挑战,提出了一种面向时间序列大数据的海量并行贝叶斯因子化分析方法(the massively parallel Bayesian factorization approach, G-BF).在缺失先验知识的情况下,通过贝叶斯算法导出因子矩阵,将算法映射至CUDA(compute unified device architecture)模型,以大规模并行的方式更新因子矩阵.该方法支持对任意维度张量的因子分解.实验结果表明:1)与通过GPU加速化的因子分解算法G-HALS(GPU-hierarchical alternative least square)相比,G-BF具有更好的运行性能,且随着数据规模的增加,其性能优越性更加明显;2)G-BF在数据处理规模、秩及维度方面都具有良好的可扩展性;3)将G-BF应用于现有子因子融合框架(hierarchical-parallel factor analysis, H-PARAFAC),可将“巨型”张量作为一个整体进行因子化分解(在2个节点上处理10\+{11}个数据元素),其能力较常规方法高出2个数量级.

关键词: 贝叶斯模型, 时间序列大数据, 张量分解, 海量并行计算, 统一计算设备架构

Abstract: Big time series data record the evolvement of a complex system(s) in large temporal and spatial scales with great details of the interactions amongst different parts of the system. Extracting the latent low-dimensional factors plays a crucial role in examining the overall mechanism of the underlying complex system(s). Research challenges arise with the lack of a priori knowledge, and most conventional factorization methods are not able to adapt to the ultra-high dimension and scales of the big data. Aiming at the grand challenge, this study develops a massively parallel Bayesian approach (G-BF) to factorization-based analysis of tensors formed by massive time series. The approach relies on a Bayesian algorithm to derive the factor matrices in the absence of a priori information. Then the algorithm has been mapped to the compute unified device architecture (CUDA) model to update the factor matrices in a massively parallel manner. The proposed approach is designed to support factorization of tensors of arbitrary dimensions. Experimental results indicated that 1) In comparison with GPU-hierarchical alternative least square (G-HALS), G-BF exhibits much better runtime performance and the superiority becomes more obvious with the increasing data scale; 2)G-BF has excellent scalability in terms of both data volume and rank; 3)Applying G-BF to the existing framework for fusing sub-factors (hierarchical-parallel factor analysis,H-PARAFAC), it becomes possible to factorize a huge tensor (volume up to 10\+{11} over two nodes) as a whole with the capability two magnitudes higher than conventional methods.

Key words: Bayesian model, big time series data, tensor factorization, massively parallel computing, compute unified device architecture (CUDA)

中图分类号: