InfoSigs：一种面向Web对象的细粒度聚类算法

盛振华  吴  羽  江锦华  寿黎但  陈  刚

InfoSigs：一种面向Web对象的细粒度聚类算法

盛振华吴羽江锦华寿黎但陈刚

InfoSigs: A Fine-Grained Clustering Algorithm for Web Objects

Sheng Zhenhua, Wu Yu, Jiang Jinhua, Shou Lidan, and Chen Gang

摘要

摘要: 面向Web对象的细粒度聚类已经成为学术界研究的热点.然而现有大多数聚类模型只关注如何对文本内容或文章主题进行聚类，聚类结果粒度较粗，无法满足大规模网络信息检索的质量要求.针对上述挑战，充分挖掘Web文档中词汇间的树状概率层次关系，提出一种以词汇信息分布作为特征标志的聚类算法InfoSigs，实现对Web对象的细粒度聚类.算法构建一个信息传递有向无环图，根据词汇在图中信息分布的集中度赋予其合理的权重，产生更具代表性的特征向量；同时算法提出了一个自适应的记录合并模型，有效提高记录簇中记录间的相似度，减少噪音对合并过程的影响.实验结果表明，InfoSigs算法比传统聚类算法—I-Match和Shingling—在F-Measure值上平均约有21.3％的提高，可以有效地运用到多领域Web对象的聚类问题.

Abstract: Clustering of objects in Web (IR) documents has recently become a hot topic in the research community of Web information retrieval (IR). Generally, quality Web IR requires fine-grained clustering of objects in documents. However, the present clustering algorithms are mostly confined to the level of sentence structure or textual topic. The lack of consideration of token information for identifying more detailed-level objects often leads to coarse-grained clustering results. To address this problem, the authors propose a novel fine-grained clustering algorithm named InfoSigs, which captures the token information signatures inside Web documents. The work contains two contributions: Firstly, techniques are presented to construct a directed acyclic graph of information-transmission from token frequency sequences implying probabilistic hierarchy property between tokens. Each token feature is given a weight value based on the aggregated information distribution obtained from the signatures in the graph. Secondly, a self-tuning method is proposed for merging records that are of high similarity. This can effectively reduce the impact from noises. The experiments on real datasets show that the proposed InfoSigs algorithm outperforms the conventional algorithms, such as I-Match and Shingling, with average improvements of 21.3% in terms of the F-Measure. The results indicate that InfoSigs is able to effectively generate more fine-grained clustering results compared with the conventional methods.

HTML全文

参考文献(0)

施引文献

资源附件(0)