ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2014, Vol. 51 ›› Issue (10): 2239-2247.doi: 10.7544/issn1000-1239.2014.20130340

• 信息处理 • 上一篇    下一篇

搜索引擎索引网页集合选取方法研究

茹立云1,2,3,李智超4,马少平1,2,3   

  1. 1(智能技术与系统国家重点实验室(清华大学) 北京 100084);2(清华信息科学与技术国家实验室(筹) 北京 100084);3(清华大学计算机科学与技术系 北京 100084); 4(北京搜狗科技发展有限公司 北京 100084) (lyru@vip.sohu.com)
  • 出版日期: 2014-10-01
  • 基金资助: 
    国家自然科学基金项目(60673039,60973068);教育部高等学校博士学科点专项科研基金项目(2009004111002)

Indexing Page Collection Selection Method for Search Engine

Ru Liyun1,2,3, Li Zhichao4, Ma Shaoping1,2,3   

  1. 1(State Key Laboratory of Intelligent Technology and Systems (Tsinghua University), Beijing 100084); 2(Tsinghua National Laboratory for Information Science and Technology, Beijing 100084); 3(Department of Computer Science and Technology, Tsinghua University, Beijing 100084); 4(Beijing Sogou Technology Development Co, Ltd. Beijing 100084)
  • Online: 2014-10-01

摘要: 随着互联网的快速发展,网页数量呈现爆炸式增长,其中充斥着大量内容相似的或低质量的网页.对于搜索引擎来讲,索引这样的网页对于检索效果并没有显著作用,反而增加了搜索引擎索引和检索的负担.提出一种用于海量网页数据中构建搜索引擎的索引网页集合的网页选取算法.一方面使用基于内容签名的聚类算法对网页进行滤重,压缩索引集合的规模;另一方面融合了网页维度和用户维度的多种特征来保证索引集合的网页质量.相关实验表明,使用该选取算法得到的索引网页集合的规模只有整个网页集合的约1/3,并且能够覆盖绝大多数的用户点击,可以满足实际用户需求.

关键词: 搜索引擎, 内容签名, 文本聚类, 机器学习, 线性回归模型

Abstract: With the rapid development of the Internet, the number of pages is growing explosively. This presents a huge challenge for search engines which provide Web page search services. There are also lots of similar or even the exact same content pages and low-quality pages. In term of search engine, indexing such pages is no significant effect for retrieval results, but increases the search engine indexing and retrieval burden. A page selection algorithm is proposed to build indexing page collection from massive Web data for search engine. One hand, signature-based cluster algorithm is used to filter the similar pages to compress the size of the indexing page collection; on the other hand it combines a variety of features of the page dimensions and user dimensions, to ensure the quality of the collection. This algorithm is not only able to quickly cluster and select pages, but also achieve a higher compression ratio while still preserving the amount of information present in the indexing page collection. Experiments with actual page collections show that the size of indexing page collection selected by the proposed algorithm is about the entire page collection by 1/3, and can meet the vast majority of user click needs, with a strong practical.

Key words: search engine, content signature, text clustering, machine learning, linear regression model

中图分类号: