高级检索
    申国伟, 杨武, 王巍, 于淼, 董国忠. 基于非负矩阵分解的大规模异构数据联合聚类[J]. 计算机研究与发展, 2016, 53(2): 459-466. DOI: 10.7544/issn1000-1239.2016.20148284
    引用本文: 申国伟, 杨武, 王巍, 于淼, 董国忠. 基于非负矩阵分解的大规模异构数据联合聚类[J]. 计算机研究与发展, 2016, 53(2): 459-466. DOI: 10.7544/issn1000-1239.2016.20148284
    Shen Guowei, Yang Wu, Wang Wei, Yu Miao, Dong Guozhong. Large-Scale Heterogeneous Data Co-Clustering Based on Nonnegative Matrix Factorization[J]. Journal of Computer Research and Development, 2016, 53(2): 459-466. DOI: 10.7544/issn1000-1239.2016.20148284
    Citation: Shen Guowei, Yang Wu, Wang Wei, Yu Miao, Dong Guozhong. Large-Scale Heterogeneous Data Co-Clustering Based on Nonnegative Matrix Factorization[J]. Journal of Computer Research and Development, 2016, 53(2): 459-466. DOI: 10.7544/issn1000-1239.2016.20148284

    基于非负矩阵分解的大规模异构数据联合聚类

    Large-Scale Heterogeneous Data Co-Clustering Based on Nonnegative Matrix Factorization

    • 摘要: 异构信息网络中包含多类实体和关系.随着数据规模增大时,不同类实体规模增长不平衡,异构关系数据也变得异常稀疏,导致聚类算法的时间复杂度高、准确率低.针对上述问题,提出了一种基于关联矩阵分解的2阶段联合聚类算法FNMTF-CM.第1阶段,抽取规模较小的一类实体中的关联关系构建关联矩阵,通过对称非负矩阵分解得到划分指示矩阵.与原始关系矩阵相比,关联矩阵的稠密度更高,规模更小.第2阶段,将划分指示矩阵作为关系矩阵三分解的输入,进而快速求解另一类实体的划分指示矩阵.在标准测试数据集和异构关系数据集上的实验表明,算法准确率和性能整体优于传统的基于非负矩阵分解的联合聚类算法.

       

      Abstract: Heterogeneous information network contains multi-typed entities and interactive relations. Some co-clustering algorithms have been proposed to mine underlying structure of different entities. However, with the increase of data scale, the scale of different class entities are growing unbalanced, and heterogeneous relational data are becoming extremely sparse. In order to solve this problem, we propose a two steps co-clustering algorithm FNMTF-CM based on correlation matrix decomposition. In the first step, the correlation matrix is built with the correlation relationship of smaller-typed entities and decomposed into indicating matrix of smaller-typed entity based on symmetric nonnegative matrix factorization. Correlation matrix has higher dense degree and smaller size compared with the original heterogeneous relationship matrix, so our algorithm can process large-scale heterogeneous data and maintain a high precision. After that, the indicating matrix of smaller-typed can be used as the input directly, so the heterogeneous relational matrix tri-factorization is very fast. Experiments on artificial and real-world heterogeneous data sets show that the accuracy and performance of FNMTF-CM algorithm are superior to the traditional co-clustering algorithms based on nonnegative matrix factorization.

       

    /

    返回文章
    返回