基于指令级并行的倒排索引压缩算法

闫宏飞; 旭东; 单栋栋; 毛先领; 赵鑫

doi:10.7544/issn1000-1239.2015.20131548

基于指令级并行的倒排索引压缩算法

SIMD-Based Inverted Index Compression Algorithms

摘要

摘要: 文本信息数量的快速增长给传统的信息检索技术带来了新的挑战.搜索引擎通常使用倒排索引来高效地处理查询.为了减少存储开销和加快访问速度，倒排索引通常被压缩存储.因此，如何选择一个高性能的压缩算法对高效查询处理是非常有必要的.在已有倒排链压缩算法PackedBinary和PForDelta的基础上，利用CPU的超标量特性和SIMD向量指令集，将其压缩和解压缩中的关键步骤并行化，提出了2种指令级并行压缩算法SIMD-PB和SIMD-PFD.基于GOV2和ClueWeb09B两个公开数据集的实验表明，SIMD-PB和SIMD-PFD算法在压缩率不变的情况下，压缩和解压缩速度比现有的压缩算法均有非常明显的提升.其中解压缩速度比起目前最好的倒排链压缩算法，最高能提升17%.此外，实验表明算法在较长的倒排链、较大的压缩块单位上有更好的解压缩性能.

Abstract: The rapid growth of text information has brought about new challenges to traditional information retrieval. In large search engines, indexing is required to help users acquire important data they need, and techniques of inverted index have great influence on the efficiency of query processing in such systems. The data in inverted index is stored in the form of arrays of integers, and techniques of compression are required to reduce the cost of storing such data in disks and memory, as well as to boost the hit rate of CPU cache and speed up transferring data. Therefore, it is necessary to choose a highly efficient compression algorithm to process query effectively. In this paper, we propose two instruction-level-parallelized algorithms, i.e. SIMD-PB and SIMD-PFD, which improve two competitive compression algorithms respectively, i.e. PackedBinary and PForDelta, and exploit SIMD instructions to accelerate the Pack and Unpack procedure in the algorithms. Experiments based on public datasets of GOV2 and ClueWeb09B show that our novel algorithms have good performance on encoding and decoding speed without impairing the compression ratio, and outperform the former fastest inverted list compression algorithms by at most 17%, with respect to decompression speed. Furthermore, experiments indicate that our novel algorithms have better performance on longer posting list and larger block size w.r.t. decoding speed.

HTML全文

参考文献(0)

施引文献

资源附件(0)