权吸引子传播算法及其在文本聚类中的应用

管仁初; 裴志利; 时小虎; 杨晨; 梁艳春

权吸引子传播算法及其在文本聚类中的应用

Weight Affinity Propagation and Its Application to Text Clustering

摘要

摘要: 吸引子传播算法(affinity propagation, AP)是一种新的高效聚类算法.由于AP算法简单易用，它已被广泛应用到数据挖掘的各个领域.在AP算法中，相似性度量具有重要作用.另一方面，传统的文本挖掘常采用向量空间模型和满足欧氏空间的相似性度量.这种方法的优点是简单且易于实现，但随着数据规模的膨胀，向量空间将变得高维稀疏并将导致计算复杂度快速增长.为解决此问题，给出了相似特征集、排斥特征集和仲裁特征集的概念，在这些概念的基础上提出了一种能够包含文本结构信息的非欧空间相似性度量方法.并提出了一种新的聚类算法，称之为权吸引子传播算法(weight affinity propagation, WAP).为检验提出算法的聚类效果，选用标准数据集Reuters-21578进行了验证.实验结果表明WAP明显优于k-means聚类算法、具备非线性特征的SOFM聚类算法和采用经典相似性度量的吸引子传播算法等3种经典聚类算法.

Abstract: Affinity propagation (AP) is a newly developed and effective clustering algorithm. For its simplicity, general applicability, and good performance, AP has been used in many data mining research fields. In AP implementations, the similarity measurement plays an important role. Conventionally, text mining is based on the whole vector space model (VSM) and its similarity measurements often fall into Euclidean space. By clustering texts in this way, the advantage is simple and easy to perform. However, when the data scale puffs up, the vector space will become high-dimensional and sparse. Then, the computational complexity grows exponentially. To overcome this difficulty, a non-Euclidean space similarity measurement is proposed based on the definitions of similar feature set (SFS), rejective feature set (RFS) and arbitral feature set (AFS). The new similarity measurement not only breaks out the Euclidean space constraint, but also contains the structural information of documents. Therefore, a novel clustering algorithm, named weight affinity propagation (WAP), is developed by combining the new similarity measurement and AP. In addition, as a benchmark dataset, Reuters-21578 is used to test the proposed algorithm. Experimental results show that the proposed method is superior to the classical k-means, traditional SOFM and affinity propagation with classic similarity measurement.

HTML全文

参考文献(0)

施引文献

资源附件(0)