ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2015, Vol. 52 ›› Issue (1): 200-210.doi: 10.7544/issn1000-1239.2015.20130493

• 人工智能 • 上一篇    下一篇

高阶异构数据层次联合聚类算法

杨欣欣,黄少滨   

  1. (哈尔滨工程大学计算机科学与技术学院 哈尔滨 150001) (yangxinxin051131@126.com)
  • 出版日期: 2015-01-01
  • 基金资助: 
    基金项目:国家自然科学基金项目(71272216)|国家科技支撑计划基金项目(2012BAH08B02)|中央高校基本科研业务专项资金项目(HEUCF100603,HEUCFZ1212)

A Hierarchical Co-Clustering Algorithm for High-Order Heterogeneous Data

Yang Xinxin, Huang Shaobin   

  1. (College of Computer Science and Technology, Harbin Engineering University, Harbin 150001)
  • Online: 2015-01-01

摘要: 在实际应用中,包含多种特征空间信息的高阶异构数据广泛出现.由于高阶联合聚类算法能够有效融合多种特征空间信息提高聚类效果,近年来逐渐成为研究热点.目前高阶联合聚类算法多数为非层次聚类算法.然而,高阶异构数据内部往往隐藏着层次聚簇结构,为了更有效地挖掘数据内部隐藏的层次聚簇模式,提出了一种高阶层次联合聚类算法(high-order hierarchical co-clustering algorithm, HHCC).该算法利用变量相关性度量指标Goodman-Kruskal τ衡量对象变量和特征变量的相关性,将相关性较强的对象划分到同一个对象聚簇中,同时将相关性较强的特征划分到同一个特征聚簇中.HHCC算法采用自顶向下的分层聚类策略,利用指标Goodman-Kruskal τ评估每层对象和特征的聚类质量,利用局部搜索方法优化指标Goodman-Kruskal τ,自动确定聚簇数目,获得每层的聚类结果,最终形成树状聚簇结构.实验结果表明HHCC算法的聚类效果优于4种经典的同构层次聚类算法和5种已有的非层次高阶联合聚类算法.

关键词: 高阶异构数据, 联合聚类, 层次聚类, 相关性度量, 多种特征空间

Abstract: The availability of high-order heterogeneous data represented with multiple features coming from heterogeneous domains is getting more and more common in real world application. High-order co-clustering algorithms can fuse multiple feature space information to improve clustering results effectivity, so recently it is becoming one of the hottest research topics. Most existing high-order co-clustering algorithms are non-hierarchical clustering algorithms. However, there are always hierarchical cluster structures hidden in high-order heterogeneous data. In order to mine the hidden patterns in datasets more effectively, we develop a high-order hierarchical co-clustering algorithm (HHCC). Goodman-Kruskal τ is used to measure the association of objects and features, which is an index measuring association of categorical variables. The objects which are strong association are partitioned into the same objects clusters, and simutaneously the features which are strong association are partitioned into the same features clusters too. HHCC algorithm uses Goodman-Kruskal τ to quantify the quality of clustering results of objects and features of every level. According to optimizing Goodman-Kruskal τ by a locally search approach, the number of clusters is automatically determined and clustering results of every hierarchy are obtained. The top-down strategy is adopted and a tree-like cluster structure is formed at last. Experimental results demonstrate that HHCC algorithm outperforms four classical homogeneous hierarchical algorithms and five previous high-order co-clustering algorithms.

Key words: high-order heterogeneous data, co-clustering, hierarchical clustering, measure of association, multiple feature space

中图分类号: