基于结构与文本关键词相关度的XML网页分类研究

An Efficient XML Documents Classification Method Based on Structure and Keywords Frequency

摘要: 针对XML网页特点，提出了计算XML文档结构相似性、文档关键词出现的位置以及关键词频度的方法，根据计算的结果提取XML网页特征，同时设计了一种基于支持向量机的XML网页多类分类算法.算法通过XML文档的训练样本集为每一类文档建立基于相似公共特征的聚类核，计算测试样本中的文档与每个聚类核的相似度，判断该文档的所属类.实验证明该分类算法具有比较高的分类查全率和查准率，能够较好地解决XML文档同时属于多个类的问题.

Abstract: According to the XML Web page character, an efficient method for computing XML document similarity, position weight and frequency of keywords in documents is presented. Then some features are selected from XML documents based on the method and a multi-classification algorithm of XML Web page is proposed using support vector machines. In this algorithm, a CFK(classifier feature kernel) of common similarity features is created from each sample set of XML documents class. The class label of an XML document is determined by computing similar distance between a test XML document and each CFK. Experimental results prove the effectiveness of the classification algorithm and good performance for multi-classification of XML documents.