Abstract:
With the increasing applications and developments of XML, XML structural clustering plays an important role both in management and in mining of XML documents. Although many XML structural clustering algorithms are proposed, they are ineffective, inefficient and sensitive to input order in practice. In addition, they can’t satisfy incremental clustering under some certain background. This paper addresses these problems by proposing a novel concept——cluster-core, and points out that incremental clustering can be supported if the cluster-cores are mantained correctly in dynamic environment. An effective XML structural clustering algorithm, COXClustering, is presented, which covers static clustering and incremental clustering. In static clustering, COXClustering extracts sub-trees to measure similarity between XML structures, and it utilizes classification to improve clustering efficiency and reduces sensitivity to input order by the orthogonality of cluster-cores. In incremental clustering, it dynamically adjusts cluster-cores based on current added XML documents, and then guides incremental clustering through both instant adjustment and batch adjustment adaptively. Finally, a comprehensive experiment on both synthetic and real dataset is conducted to show that COXClustering is capable of improving clustering efficiency and quality, as well as being insensitive to input order in static clustering. The experiment also shows that incremental clustering highly speeds up clustering and the quality of incremental clustering is close to that of static clustering.