ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2016, Vol. 53 ›› Issue (2): 459-466.doi: 10.7544/issn1000-1239.2016.20148284

• 软件技术 • 上一篇    下一篇

基于非负矩阵分解的大规模异构数据联合聚类

申国伟,杨武,王巍,于淼,董国忠   

  1. (哈尔滨工程大学信息安全研究中心 哈尔滨 150001) (shenguowei@hrbeu.edu.cn)
  • 出版日期: 2016-02-01
  • 基金资助: 
    国家“八六三”高技术研究发展计划基金项目(2012AA012802);国家自然科学基金项目(61170242)

Large-Scale Heterogeneous Data Co-Clustering Based on Nonnegative Matrix Factorization

Shen Guowei, Yang Wu, Wang Wei , Yu Miao, Dong Guozhong   

  1. (Research Center of Information Security, Harbin Engineering University, Harbin 150001)
  • Online: 2016-02-01

摘要: 异构信息网络中包含多类实体和关系.随着数据规模增大时,不同类实体规模增长不平衡,异构关系数据也变得异常稀疏,导致聚类算法的时间复杂度高、准确率低.针对上述问题,提出了一种基于关联矩阵分解的2阶段联合聚类算法FNMTF-CM.第1阶段,抽取规模较小的一类实体中的关联关系构建关联矩阵,通过对称非负矩阵分解得到划分指示矩阵.与原始关系矩阵相比,关联矩阵的稠密度更高,规模更小.第2阶段,将划分指示矩阵作为关系矩阵三分解的输入,进而快速求解另一类实体的划分指示矩阵.在标准测试数据集和异构关系数据集上的实验表明,算法准确率和性能整体优于传统的基于非负矩阵分解的联合聚类算法.

关键词: 异构网络, 联合聚类, 非负矩阵分解, 大规模数据, 关联矩阵

Abstract: Heterogeneous information network contains multi-typed entities and interactive relations. Some co-clustering algorithms have been proposed to mine underlying structure of different entities. However, with the increase of data scale, the scale of different class entities are growing unbalanced, and heterogeneous relational data are becoming extremely sparse. In order to solve this problem, we propose a two steps co-clustering algorithm FNMTF-CM based on correlation matrix decomposition. In the first step, the correlation matrix is built with the correlation relationship of smaller-typed entities and decomposed into indicating matrix of smaller-typed entity based on symmetric nonnegative matrix factorization. Correlation matrix has higher dense degree and smaller size compared with the original heterogeneous relationship matrix, so our algorithm can process large-scale heterogeneous data and maintain a high precision. After that, the indicating matrix of smaller-typed can be used as the input directly, so the heterogeneous relational matrix tri-factorization is very fast. Experiments on artificial and real-world heterogeneous data sets show that the accuracy and performance of FNMTF-CM algorithm are superior to the traditional co-clustering algorithms based on nonnegative matrix factorization.

Key words: heterogeneous network, co-clustering, nonnegative matrix factorization, large-scale data, correlation matrix

中图分类号: