基于概率生成模型的网络数据分类方法

王桢文  肖卫东  谭文堂

基于概率生成模型的网络数据分类方法

王桢文肖卫东谭文堂

Classification in Networked Data Based on the Probability Generative Model

Wang Zhenwen, Xiao Weidong, and Tan Wentang

摘要

摘要: 利用实体之间的相互关系来对实体进行分类的网络数据分类是数据挖掘的一个重要研究内容.现有的网络数据分类方法普遍根据邻居节点的类别来对节点进行分类.这些方法在同质性程度较高的网络中达到了很高的分类精度.然而在现实世界中，存在许多同质性程度很低的网络.在低同质性网络中，大多数相连节点的类别不同，所以现有方法难以正确预测出节点的类别.因此，提出了一种新的网络数据分类方法.其主要思路是建立一个描述网络的概率生成模型.在这个概率生成模型中，将网络中的边作为观察变量，将未知类别节点的类别作为潜在变量.通过吉布斯采样方法对模型进行求解，计算出潜在变量的取值，从而得到未知类别节点的类别.在真实数据集上的对比实验表明，提出的分类方法在低同质性网络上有更好的分类性能.

Abstract: Classification in networked data, which classify entities based on their relationship information, is an important research issue of the data mining field. The previous methods usually assign a class to a node based on the classes of its neighbor nodes. These methods have high performance of classification in the networks with high. However, there are many networks with low homophily in the real world. In the networks with low homophily, there are a majority of connected nodes whose classes are different from each other. The previous methods cannot assign the correct classes to the nodes in such networks. Therefore, a novel method of classification in networked data is proposed in this paper. The main idea of the proposed method is to build a new generative model for networks, in which the edges of networks are observed variables and the classes of the nodes whose classes are unknown are latent variables. The values of latent variables can be calculated by fitting the generative model to the network. Consequently, the classes of the nodes whose classes are unknown are obtained. Experimental results on the real datasets show that the proposed method can provide better performance than the previous methods in the networks with low homophily.

HTML全文

参考文献(0)

施引文献

资源附件(0)