Abstract:
Deduplication is an effective way to improve storage utilization by eliminating redundant copies of duplicate data and replacing them with logical pointers to the unique copy. At present, it has been widely used in backup and archive systems. However, most of the existing deduplication systems use hashing to compute and compare data chunks to determine whether they are redundant. The Hash-based exact match is too strict for many applications, for example image deduplication. To solve this problem, a fast and accurate image deduplication approach is presented. We firstly give the definition of duplicate images according to the characteristics of Web images, and then divide image deduplication into two stages: duplicate image detection and duplicate image deduplication. In the first stage, we use perceptual hashing to improve image retrieval speed and multiple filters to improve image retrieval accuracy. In the second stage, we use fuzzy logic reasoning to select the proper centroid-images from duplicate image sets by simulating the process of human thinking. Experimental results demonstrate that the proposed approach not only has a fast and accurate ability to detect duplicate images, but also meets users’perceptive requirements in the selection of centroid-images.