ISSN 1000-1239 CN 11-1777/TP

• 人工智能 •

### 基于随机投影的高维数据流聚类

1. 1(南京航空航天大学计算机科学与技术学院 南京 211106);2(模式分析与机器智能工业和信息化部重点实验室(南京航空航天大学) 南京 211106);3(三江学院计算机科学与工程学院 南京 210012) (yingwen.zhu@nuaa.edu.cn)
• 出版日期: 2020-08-01
• 基金资助:
国家自然科学基金重点项目(61732006)

### High Dimensional Data Stream Clustering Algorithm Based on Random Projection

Zhu Yingwen1,2,3, Chen Songcan1,2

1. 1(College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 211106);2(MIIT Key Laboratory of Pattern Analysis and Machine Intelligence (Nanjing University of Aeronautics and Astronautics), Nanjing 211106);3(College of Computer Science and Engineering, Sanjiang University, Nanjing 210012)
• Online: 2020-08-01
• Supported by:
This work was supported by the Key Program of National Natural Science Foundation of China (61732006).

Abstract: High dimensional data streams emerge ubiquitously in many real-world applications such as network monitoring. Clustering such data streams differs from traditional data clustering algorithm where the given datasets are generally static and can be read and processed repeatedly, thus facing more challenges due to having to satisfy such constraints as bounded memory, single-pass, real-time response and concept-drift detection. Recently many methods of such type have been proposed. However, when dealing with high dimensional data, they often result in high computational cost and poor performance due to the curse of dimensionality. To address the above problem, in this paper we present a new clustering algorithm for data streams, called RPFART, by combining the random projection method with the adaptive resonance theory (ART) model that has linear computational complexity, uses a single parameter, i.e., the vigilance parameter to identify data clusters, and is robust to modest parameters setting. To gain insights into the performance improvement obtained by our algorithm, we analyze and identify the major influence of random projection on ART. Although our method is embarrassingly simple just by incorporating the random projection into ART, the experimental results on variety of benchmark datasets indicate that our method can still achieve comparable or even better performance than RPGStream algorithm even if the raw dimension is compressed up to 10% of the original one. For ACT1 dataset, its dimension is reduced from 67500 to 6750.