基于增量切空间校准的自适应流式大数据学习算法

谈超; 吉根林; 赵斌

doi:10.7544/issn1000-1239.2017.20160712

基于增量切空间校准的自适应流式大数据学习算法

Self-Adaptive Streaming Big Data Learning Algorithm Based on Incremental Tangent Space Alignment

摘要

摘要: 流形学习是为了寻找高维空间中观测数据的低维嵌入.作为一种有效的非线性维数约减方法，流形学习被广泛应用于数据挖掘、模式识别等机器学习领域.然而，对于样本外点学习、增量学习和在线学习等流形学习方法，面对流式大数据的学习算法时间效率较低.为此提出了一种新的基于增量切空间的自适应流式大数据学习算法(self-adaptive streaming big data learning algorithm based on incremental tangent space alignment, SLITSA)，该算法采用增量PCA的思想，增量地构造子空间，能在线或增量地检测数据流中的内在低维流形结构，在迭代过程中构建新的切空间进行调准，保证了算法的收敛性并降低了重构误差.通过人工数据集以及真实数据集上的实验表明:该算法分类精度和时间效率优于其他学习算法，可推广到在线或流式大数据的应用当中.

Abstract: Manifold learning is developed to find the observed data's low-dimension embeddings in high dimensional data space. As a type of effective nonlinear dimension reduction method, it has been widely applied to the machine learning field, such as data mining and pattern recognition, etc. However, when processing a large scale data stream, the complexity of time is too high for many traditional manifold learning algorithms, including out of sample learning algorithm, incremental learning algorithm, online learning algorithm, and so on. This paper presents a novel self-adaptive learning algorithm based on incremental tangent space alignment (named SLITSA) for big data stream processing. SLITSA adopts the incremental PCA to construct the subspace incrementally, and can detect the intrinsic low dimensional manifold structure of data streams online or incrementally. In order to ensure the convergence of SLITSA and reduce the reconstruction error, it can also construct a new tangent space for adjustment during the iterative process. Experiments on artificial data sets and real data sets show that the classification accuracy and time efficiency of the proposed algorithm are better than other manifold learning algorithms, which can be extended to the application of streaming data and real-time big data analytics.

HTML全文

参考文献(0)

施引文献

资源附件(0)