• 中国精品科技期刊
  • CCF推荐A类中文期刊
  • 计算领域高质量科技期刊T1类
高级检索

基于随机投影的高维数据流聚类

朱颖雯, 陈松灿

朱颖雯, 陈松灿. 基于随机投影的高维数据流聚类[J]. 计算机研究与发展, 2020, 57(8): 1683-1696. DOI: 10.7544/issn1000-1239.2020.20200432
引用本文: 朱颖雯, 陈松灿. 基于随机投影的高维数据流聚类[J]. 计算机研究与发展, 2020, 57(8): 1683-1696. DOI: 10.7544/issn1000-1239.2020.20200432
Zhu Yingwen, Chen Songcan. High Dimensional Data Stream Clustering Algorithm Based on Random Projection[J]. Journal of Computer Research and Development, 2020, 57(8): 1683-1696. DOI: 10.7544/issn1000-1239.2020.20200432
Citation: Zhu Yingwen, Chen Songcan. High Dimensional Data Stream Clustering Algorithm Based on Random Projection[J]. Journal of Computer Research and Development, 2020, 57(8): 1683-1696. DOI: 10.7544/issn1000-1239.2020.20200432
朱颖雯, 陈松灿. 基于随机投影的高维数据流聚类[J]. 计算机研究与发展, 2020, 57(8): 1683-1696. CSTR: 32373.14.issn1000-1239.2020.20200432
引用本文: 朱颖雯, 陈松灿. 基于随机投影的高维数据流聚类[J]. 计算机研究与发展, 2020, 57(8): 1683-1696. CSTR: 32373.14.issn1000-1239.2020.20200432
Zhu Yingwen, Chen Songcan. High Dimensional Data Stream Clustering Algorithm Based on Random Projection[J]. Journal of Computer Research and Development, 2020, 57(8): 1683-1696. CSTR: 32373.14.issn1000-1239.2020.20200432
Citation: Zhu Yingwen, Chen Songcan. High Dimensional Data Stream Clustering Algorithm Based on Random Projection[J]. Journal of Computer Research and Development, 2020, 57(8): 1683-1696. CSTR: 32373.14.issn1000-1239.2020.20200432

基于随机投影的高维数据流聚类

基金项目: 国家自然科学基金重点项目(61732006)
详细信息
  • 中图分类号: TP391

High Dimensional Data Stream Clustering Algorithm Based on Random Projection

Funds: This work was supported by the Key Program of National Natural Science Foundation of China (61732006).
  • 摘要: 高维数据流在许多现实应用中广泛存在,例如网络监控.不同于传统的静态数据聚类问题,数据流聚类面临有限内存、单遍扫描、实时响应和概念漂移等问题.然而现有许多数据流聚类算法在处理高维数据时,常常因产生维数灾难而导致高计算复杂度和较差的性能.为了解决此问题,基于随机投影和自适应谐振理论(adaptive resonance theory, ART)提出了一种针对高维数据流的高效聚类算法RPFART.该算法具有线性计算复杂度,仅包含1个超参数,并对参数设置鲁棒.详细分析了随机投影对ART的主要影响,尽管该算法仅简单地将随机投影与ART方法进行了结合,但在多个数据集上的实验结果表明:即使将原始尺寸压缩到10%,该方法仍可以达到与RPGStream算法相当的性能.对于ACT1数据集,其维数从67500减少到6750.
    Abstract: High dimensional data streams emerge ubiquitously in many real-world applications such as network monitoring. Clustering such data streams differs from traditional data clustering algorithm where the given datasets are generally static and can be read and processed repeatedly, thus facing more challenges due to having to satisfy such constraints as bounded memory, single-pass, real-time response and concept-drift detection. Recently many methods of such type have been proposed. However, when dealing with high dimensional data, they often result in high computational cost and poor performance due to the curse of dimensionality. To address the above problem, in this paper we present a new clustering algorithm for data streams, called RPFART, by combining the random projection method with the adaptive resonance theory (ART) model that has linear computational complexity, uses a single parameter, i.e., the vigilance parameter to identify data clusters, and is robust to modest parameters setting. To gain insights into the performance improvement obtained by our algorithm, we analyze and identify the major influence of random projection on ART. Although our method is embarrassingly simple just by incorporating the random projection into ART, the experimental results on variety of benchmark datasets indicate that our method can still achieve comparable or even better performance than RPGStream algorithm even if the raw dimension is compressed up to 10% of the original one. For ACT1 dataset, its dimension is reduced from 67500 to 6750.
  • 期刊类型引用(10)

    1. 杨琳,刘政,叶禹杉,逢健飞,何晶,周炫孜,汪琼,曹新生,刘涛. 基于人体成分分析的智能化空降兵军事体能训练系统设计. 医疗卫生装备. 2025(02): 16-23 . 百度学术
    2. 李沛衡,林宏刚. 融合图结构学习的物联网僵尸网络多分类检测研究. 小型微型计算机系统. 2025(02): 456-464 . 百度学术
    3. 翁佳桥,吕莉,樊棠怀,康平. 基于密度峰值的进化数据流聚类算法. 计算机仿真. 2024(06): 448-454 . 百度学术
    4. 何宇新,廖长江,何新旭. 基于多尺度信息熵特征的数据流快速聚类研究. 电子设计工程. 2024(15): 41-44 . 百度学术
    5. 张瑞霖,郑海阳,苗振国,王鸿鹏. 基于空间向量分解的边界剥离密度聚类. 自动化学报. 2023(06): 1195-1213 . 百度学术
    6. 杨成义,熊才权. 高维空间数据灰色凸关联度聚类算法仿真. 计算机仿真. 2023(06): 523-527 . 百度学术
    7. 朱颖雯,陈松灿. 数据流聚类算法研究. 数据采集与处理. 2022(04): 894-908 . 百度学术
    8. 冯建英,石岩,王博,穆维松. 基于聚类分析的数据挖掘技术及其农业应用研究进展. 农业机械学报. 2022(S1): 201-212 . 百度学术
    9. 孙洁丽,刘沛,翟浩文. 基于高维数据的聚类研究综述. 河北省科学院学报. 2022(05): 1-6 . 百度学术
    10. 李志杰,廖旭红,刘基旺,江华. 一种数据流自适应两阶段聚类算法. 现代信息科技. 2021(14): 124-126 . 百度学术

    其他类型引用(4)

计量
  • 文章访问数:  719
  • HTML全文浏览量:  0
  • PDF下载量:  374
  • 被引次数: 14
出版历程
  • 发布日期:  2020-07-31

目录

    /

    返回文章
    返回