搜索引擎索引网页集合选取方法研究

茹立云; 李智超; 马少平

doi:10.7544/issn1000-1239.2014.20130340

搜索引擎索引网页集合选取方法研究

Indexing Page Collection Selection Method for Search Engine

摘要

摘要: 随着互联网的快速发展，网页数量呈现爆炸式增长，其中充斥着大量内容相似的或低质量的网页.对于搜索引擎来讲，索引这样的网页对于检索效果并没有显著作用，反而增加了搜索引擎索引和检索的负担.提出一种用于海量网页数据中构建搜索引擎的索引网页集合的网页选取算法.一方面使用基于内容签名的聚类算法对网页进行滤重，压缩索引集合的规模；另一方面融合了网页维度和用户维度的多种特征来保证索引集合的网页质量.相关实验表明，使用该选取算法得到的索引网页集合的规模只有整个网页集合的约1/3，并且能够覆盖绝大多数的用户点击，可以满足实际用户需求.

Abstract: With the rapid development of the Internet, the number of pages is growing explosively. This presents a huge challenge for search engines which provide Web page search services. There are also lots of similar or even the exact same content pages and low-quality pages. In term of search engine, indexing such pages is no significant effect for retrieval results, but increases the search engine indexing and retrieval burden. A page selection algorithm is proposed to build indexing page collection from massive Web data for search engine. One hand, signature-based cluster algorithm is used to filter the similar pages to compress the size of the indexing page collection; on the other hand it combines a variety of features of the page dimensions and user dimensions, to ensure the quality of the collection. This algorithm is not only able to quickly cluster and select pages, but also achieve a higher compression ratio while still preserving the amount of information present in the indexing page collection. Experiments with actual page collections show that the size of indexing page collection selected by the proposed algorithm is about the entire page collection by 1/3, and can meet the vast majority of user click needs, with a strong practical.

HTML全文

参考文献(0)

施引文献

资源附件(0)