Abstract:
XML is the de facto standard for data exchange and data storage in network applications. The main problem in the management of XML data is the redundancy caused by its mingling structure and data, which causes high costs in storing, exchanging and processing of XML data. Data compression techniques can be used to reduce such redundancy. However, most of the existing XML compression methods only try to reduce the redundancy in each single XML document, while ignoring the redundancy among XML documents. Presented in this paper, is a new XML compression method XCluster, which utilizes the similarity among XML documents. Queries can be evaluated on the compressed XML documents generated by XCluster directly. XCluster uses the improved pq-gram approximate distance between root-ordered tag trees to cluster the input XML documents hierarchically first. Then it compresses the structures in each clustered subset of XML documents by obtaining a representative structure through merging operation. Finally, it puts data of nodes with same tags into same buckets and encodes data in each bucket with a suitable algorithm according to the type of data. Extensive experiments on both real datasets and synthetic datasets show that XClutster outperforms XGrind and XQilla in both compression ratio and efficiency of query processing.