Abstract:
Clustering of objects in Web (IR) documents has recently become a hot topic in the research community of Web information retrieval (IR). Generally, quality Web IR requires fine-grained clustering of objects in documents. However, the present clustering algorithms are mostly confined to the level of sentence structure or textual topic. The lack of consideration of token information for identifying more detailed-level objects often leads to coarse-grained clustering results. To address this problem, the authors propose a novel fine-grained clustering algorithm named InfoSigs, which captures the token information signatures inside Web documents. The work contains two contributions: Firstly, techniques are presented to construct a directed acyclic graph of information-transmission from token frequency sequences implying probabilistic hierarchy property between tokens. Each token feature is given a weight value based on the aggregated information distribution obtained from the signatures in the graph. Secondly, a self-tuning method is proposed for merging records that are of high similarity. This can effectively reduce the impact from noises. The experiments on real datasets show that the proposed InfoSigs algorithm outperforms the conventional algorithms, such as I-Match and Shingling, with average improvements of 21.3% in terms of the F-Measure. The results indicate that InfoSigs is able to effectively generate more fine-grained clustering results compared with the conventional methods.