ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2015, Vol. 52 ›› Issue (5): 995-1004.doi: 10.7544/issn1000-1239.2015.20131548

• 信息处理 •    下一篇

基于指令级并行的倒排索引压缩算法

闫宏飞1,旭东1,单栋栋2,毛先领3,赵鑫1   

  1. 1(北京大学网络与信息系统研究所 北京 100871); 2(淘宝(中国)软件有限公司 杭州 312000); 3(北京理工大学 北京 100081) (yhf@net.pku.edu.cn)
  • 出版日期: 2015-05-01
  • 基金资助: 
    基金项目:国家“九七三”重点基础研究发展计划基金项目(2014CB340400);国家自然科学基金项目(61272340);江苏未来网络创新研究院项目-云服务数字资源搜索(BY2013095-4-02)

SIMD-Based Inverted Index Compression Algorithms

Yan Hongfei1, Zhang Xudong1, Shan Dongdong2, Mao Xianling3, Zhao Xin1   

  1. 1(Institute of Network Computing and Information Systems, Peking University, Beijing 100871); 2(Taobao (China) Software Co., Ltd, Hangzhou 312000); 3(Beijing Institute of Technology, Beijing 100081)
  • Online: 2015-05-01

摘要: 文本信息数量的快速增长给传统的信息检索技术带来了新的挑战.搜索引擎通常使用倒排索引来高效地处理查询.为了减少存储开销和加快访问速度,倒排索引通常被压缩存储.因此,如何选择一个高性能的压缩算法对高效查询处理是非常有必要的.在已有倒排链压缩算法PackedBinary和PForDelta的基础上,利用CPU的超标量特性和SIMD向量指令集,将其压缩和解压缩中的关键步骤并行化,提出了2种指令级并行压缩算法SIMD-PB和SIMD-PFD.基于GOV2和ClueWeb09B两个公开数据集的实验表明,SIMD-PB和SIMD-PFD算法在压缩率不变的情况下,压缩和解压缩速度比现有的压缩算法均有非常明显的提升.其中解压缩速度比起目前最好的倒排链压缩算法,最高能提升17%.此外,实验表明算法在较长的倒排链、较大的压缩块单位上有更好的解压缩性能.

关键词: 单指令多数据流, 倒排索引, 压缩, 整数编码, 信息检索

Abstract: The rapid growth of text information has brought about new challenges to traditional information retrieval. In large search engines, indexing is required to help users acquire important data they need, and techniques of inverted index have great influence on the efficiency of query processing in such systems. The data in inverted index is stored in the form of arrays of integers, and techniques of compression are required to reduce the cost of storing such data in disks and memory, as well as to boost the hit rate of CPU cache and speed up transferring data. Therefore, it is necessary to choose a highly efficient compression algorithm to process query effectively. In this paper, we propose two instruction-level-parallelized algorithms, i.e. SIMD-PB and SIMD-PFD, which improve two competitive compression algorithms respectively, i.e. PackedBinary and PForDelta, and exploit SIMD instructions to accelerate the Pack and Unpack procedure in the algorithms. Experiments based on public datasets of GOV2 and ClueWeb09B show that our novel algorithms have good performance on encoding and decoding speed without impairing the compression ratio, and outperform the former fastest inverted list compression algorithms by at most 17%, with respect to decompression speed. Furthermore, experiments indicate that our novel algorithms have better performance on longer posting list and larger block size w.r.t. decoding speed.

Key words: single instruction multiple data(SIMD), inverted index, compression, integer encoding, information retrieval

中图分类号: