Abstract:
Data warehouse which utilizes the column-oriented store approach appears to be more conducive to data compression. Order-preserving lightweight compression methods show its superiority on the compression of column stored string data. However, they seldom consider the probability of string occurence, which would affect the compression performance. This paper presents a probability-based order-preserving string compression method. First, we propose an improved shared leaf structure. It makes the encoding and decoding index share the same code table, which greatly reduces the time of maintaining the encoding and decoding index. At the same time, we record the probability of the string in the proposed structure, then establish the decoding index according to the probability. It effectively reduces the decompression time of high-frequency strings. Further more, this paper also preserves the information of row-id in the proposed leaf structure according to the column storage characteristics, thus effectively reducing the storage space and creation time for the column index. The experimental results demonstrate the effectiveness of the proposed method.