ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2017, Vol. 54 ›› Issue (11): 2547-2557.doi: 10.7544/issn1000-1239.2017.20160712

• 人工智能 • 上一篇    下一篇

基于增量切空间校准的自适应流式大数据学习算法

谈超,吉根林,赵斌   

  1. (南京师范大学计算机科学与技术学院 南京 210023) (tutu_tanchao@163.com)
  • 出版日期: 2017-11-01
  • 基金资助: 
    国家自然科学基金项目(41471371,61702270);江苏省高校自然科学基金项目(15KJB520022)

Self-Adaptive Streaming Big Data Learning Algorithm Based on Incremental Tangent Space Alignment

Tan Chao, Ji Genlin, Zhao Bin   

  1. (School of Computer Science and Technology, Nanjing Normal University, Nanjing 210023)
  • Online: 2017-11-01

摘要: 流形学习是为了寻找高维空间中观测数据的低维嵌入.作为一种有效的非线性维数约减方法,流形学习被广泛应用于数据挖掘、模式识别等机器学习领域.然而,对于样本外点学习、增量学习和在线学习等流形学习方法,面对流式大数据的学习算法时间效率较低.为此提出了一种新的基于增量切空间的自适应流式大数据学习算法(self-adaptive streaming big data learning algorithm based on incremental tangent space alignment, SLITSA),该算法采用增量PCA的思想,增量地构造子空间,能在线或增量地检测数据流中的内在低维流形结构,在迭代过程中构建新的切空间进行调准,保证了算法的收敛性并降低了重构误差.通过人工数据集以及真实数据集上的实验表明:该算法分类精度和时间效率优于其他学习算法,可推广到在线或流式大数据的应用当中.

关键词: 流形学习, 非线性维数约减, 流式大数据, 增量切空间, 自适应

Abstract: Manifold learning is developed to find the observed data's low-dimension embeddings in high dimensional data space. As a type of effective nonlinear dimension reduction method, it has been widely applied to the machine learning field, such as data mining and pattern recognition, etc. However, when processing a large scale data stream, the complexity of time is too high for many traditional manifold learning algorithms, including out of sample learning algorithm, incremental learning algorithm, online learning algorithm, and so on. This paper presents a novel self-adaptive learning algorithm based on incremental tangent space alignment (named SLITSA) for big data stream processing. SLITSA adopts the incremental PCA to construct the subspace incrementally, and can detect the intrinsic low dimensional manifold structure of data streams online or incrementally. In order to ensure the convergence of SLITSA and reduce the reconstruction error, it can also construct a new tangent space for adjustment during the iterative process. Experiments on artificial data sets and real data sets show that the classification accuracy and time efficiency of the proposed algorithm are better than other manifold learning algorithms, which can be extended to the application of streaming data and real-time big data analytics.

Key words: manifold learning, nonlinear dimension reduction, big data streams, incremental tangent space, self-adaptive

中图分类号: