高级检索
    易高翔 胡和平. 一种基于容错粗糙集的Web搜索结果聚类方法[J]. 计算机研究与发展, 2006, 43(2): 275-280.
    引用本文: 易高翔 胡和平. 一种基于容错粗糙集的Web搜索结果聚类方法[J]. 计算机研究与发展, 2006, 43(2): 275-280.
    Yi Gaoxiang and Hu Heping. A Web Search Result Clustering Based on Tolerance Rough Set[J]. Journal of Computer Research and Development, 2006, 43(2): 275-280.
    Citation: Yi Gaoxiang and Hu Heping. A Web Search Result Clustering Based on Tolerance Rough Set[J]. Journal of Computer Research and Development, 2006, 43(2): 275-280.

    一种基于容错粗糙集的Web搜索结果聚类方法

    A Web Search Result Clustering Based on Tolerance Rough Set

    • 摘要: 一些Web聚类方法把类严格作为互斥的关系,聚类效果不理想.一种基于容错粗糙集的k均值的聚类解决了这一问题.首先运用向量模型表示Web文档信息,采用常规方法得到文本特征词集,然后利用某些特征词协同出现的价值,构造特征词容错关系,扩充特征词的描述能力,最后用特征词容错类描述文档之间的相似关系,实现了Web搜索结果聚类,并提出了简单直观的衡量聚类精度的T模型.实验结果表明,利用容错关系聚类的类标记描述性强、容易理解、明显优于普通k均值算法.

       

      Abstract: Most of Web clustering algorithms considered classes of mutually exclusive concepts, few took the fact of overlap concept between clusters into account, so the cluster result is not very good. In fact, a single page usually falls into several categories. That is to say, there exit indiscernible relation between clusters. Rough sets theory was first presented by Pawlak professor in 1982, which was a prefect tool that denoted indiscernible relation between sets. A k-mean algorithm for Web search results clustering based on tolerance rough set is proposed. Firstly, Web document are denoted by vector space model with terms. Then the value of term co-occurrence is utilized for the description of tolerance class of term, which extends the capability of term to document. Finally, a Web search result clustering algorithm is implemented, in which the similarity between documents is described by the term tolerance class, and a simple and intuitionistic T criterion for estimating cluster precision is also presented. The proposed solution is evaluated in search results returned from actual Web search engines and compared with other recent methods. Finally, apprehensible class labels and a good improvement are gained by using tolerance classes in Web result clustering.

       

    /

    返回文章
    返回