列存储数据仓库中基于概率的保序字符串压缩方法

夏小玲  李海燕  王  梅

列存储数据仓库中基于概率的保序字符串压缩方法

夏小玲李海燕王梅

Probability-Based Order-Preserving String Compression in Column-Oriented Data Warehouse

Xia Xiaoling, Li Haiyan, and Wang Mei

摘要

摘要: 数据仓库中采用按列存储的方式更有利于数据的压缩，保留顺序的轻量级压缩方法对列存储的字符串属性压缩显示其优越性，然而现有做法很难兼顾字符串出现的概率对压缩效率的影响，影响了压缩性能.因此，提出一种基于概率的保序字符串压缩方法.首先，提出一种扩展的共用叶子结构，使得编码索引和解码索引共享同一个码表，大大减少了编码和解码索引的维护时间.同时在该结构中，记录字符串出现的概率，并根据概率的高低建立解码索引，有效降低了高频字符串的解压时间.进一步，根据列存储的特点，将用于列连接的行号信息保存在扩展的叶子结构中，从而有效减少了列值索引的存储空间和创建时间.实验结果验证了该方法的有效性.

Abstract: Data warehouse which utilizes the column-oriented store approach appears to be more conducive to data compression. Order-preserving lightweight compression methods show its superiority on the compression of column stored string data. However, they seldom consider the probability of string occurence, which would affect the compression performance. This paper presents a probability-based order-preserving string compression method. First, we propose an improved shared leaf structure. It makes the encoding and decoding index share the same code table, which greatly reduces the time of maintaining the encoding and decoding index. At the same time, we record the probability of the string in the proposed structure, then establish the decoding index according to the probability. It effectively reduces the decompression time of high-frequency strings. Further more, this paper also preserves the information of row-id in the proposed leaf structure according to the column storage characteristics, thus effectively reducing the storage space and creation time for the column index. The experimental results demonstrate the effectiveness of the proposed method.

HTML全文

参考文献(0)

施引文献

资源附件(0)