ISSN 1000-1239 CN 11-1777/TP

Journal of Computer Research and Development ›› 2020, Vol. 57 ›› Issue (2): 269-280.doi: 10.7544/issn1000-1239.2020.20190543

Special Issue: 2020大数据与智能存储系统前沿技术专题

Previous Articles     Next Articles

Fingerprint Search Optimization for Deduplication on Emerging Storage Devices

He Kewen, Zhang Jiachen, Liu Xiaoguang, and Wang Gang   

  1. (College of Computer Science, Nankai University, Tianjin 300350) (Tianjin Key Laboratory of Network and Data Security Technology (Nankai University), Tianjin 300350)
  • Online:2020-02-01
  • Supported by: 
    This work was supported by the National Natural Science Foundation of China (U1833114, 61872201, 61702521, 61602266), the Natural Science Foundation of Tianjin (17JCYBJC15300, 16JCYBJC41900), the Artificial Intelligence Major Project of Tianjin (18ZXZNGX00140, 18ZXZNGX00200), and the Fundamental Research Funds for the Central Universities.

Abstract: Fingerprint search part is I/O intensive, and the performance of the external storage device is the bottleneck of fingerprint search. Therefore, this paper focuses on the fingerprint search of data deduplication system. This paper compares the traditional eager deduplication algorithm with lazy deduplication algorithms that reduce the number of disk accesses, and studies deduplication algorithm on the emerging storage devices: Optane SSD and persistent memory, and gives optimization suggestions. In this paper, we model the fingerprint search delay of the eager deduplication algorithm and the lazy deduplication algorithm, and three conclusions under the new storage device are obtained through the modeling results: 1) The number of fingerprints for batched search should be reduced; 2) The local ring size should be reduced on faster devices, and the local loop size has an optimal value; 3) On fast devices, the eager fingerprint lookup is better than the lazy fingerprint lookup. Finally, the experimental results verify the correctness of our model on the actual HDD, Optane SSD and emulated persistent memory. The eager algorithm is better than the lazy algorithm on the emerging storage devices, and the locality ring optimal value is advanced, which basically conforms to the conclusions of the proposed model.

Key words: deduplication, persistent memory, fingerprint index, emerging storage device, data spatial locality

CLC Number: