ISSN 1000-1239 CN 11-1777/TP

Journal of Computer Research and Development ›› 2015, Vol. 52 ›› Issue (5): 995-1004.doi: 10.7544/issn1000-1239.2015.20131548

    Next Articles

SIMD-Based Inverted Index Compression Algorithms

Yan Hongfei1, Zhang Xudong1, Shan Dongdong2, Mao Xianling3, Zhao Xin1   

  1. 1(Institute of Network Computing and Information Systems, Peking University, Beijing 100871); 2(Taobao (China) Software Co., Ltd, Hangzhou 312000); 3(Beijing Institute of Technology, Beijing 100081)
  • Online:2015-05-01

Abstract: The rapid growth of text information has brought about new challenges to traditional information retrieval. In large search engines, indexing is required to help users acquire important data they need, and techniques of inverted index have great influence on the efficiency of query processing in such systems. The data in inverted index is stored in the form of arrays of integers, and techniques of compression are required to reduce the cost of storing such data in disks and memory, as well as to boost the hit rate of CPU cache and speed up transferring data. Therefore, it is necessary to choose a highly efficient compression algorithm to process query effectively. In this paper, we propose two instruction-level-parallelized algorithms, i.e. SIMD-PB and SIMD-PFD, which improve two competitive compression algorithms respectively, i.e. PackedBinary and PForDelta, and exploit SIMD instructions to accelerate the Pack and Unpack procedure in the algorithms. Experiments based on public datasets of GOV2 and ClueWeb09B show that our novel algorithms have good performance on encoding and decoding speed without impairing the compression ratio, and outperform the former fastest inverted list compression algorithms by at most 17%, with respect to decompression speed. Furthermore, experiments indicate that our novel algorithms have better performance on longer posting list and larger block size w.r.t. decoding speed.

Key words: single instruction multiple data(SIMD), inverted index, compression, integer encoding, information retrieval

CLC Number: