Abstract:
Most of Web clustering algorithms considered classes of mutually exclusive concepts, few took the fact of overlap concept between clusters into account, so the cluster result is not very good. In fact, a single page usually falls into several categories. That is to say, there exit indiscernible relation between clusters. Rough sets theory was first presented by Pawlak professor in 1982, which was a prefect tool that denoted indiscernible relation between sets. A k-mean algorithm for Web search results clustering based on tolerance rough set is proposed. Firstly, Web document are denoted by vector space model with terms. Then the value of term co-occurrence is utilized for the description of tolerance class of term, which extends the capability of term to document. Finally, a Web search result clustering algorithm is implemented, in which the similarity between documents is described by the term tolerance class, and a simple and intuitionistic T criterion for estimating cluster precision is also presented. The proposed solution is evaluated in search results returned from actual Web search engines and compared with other recent methods. Finally, apprehensible class labels and a good improvement are gained by using tolerance classes in Web result clustering.