Zhang Zilin, Liu Duo, Tan Yujuan, Wu Yu, Luo Longpan, Wang Weilüe, Qiao Lei. An Erasure-Coded Data Update Method for Distributed Storage Clusters[J]. Journal of Computer Research and Development, 2022, 59(11): 2451-2466. DOI: 10.7544/issn1000-1239.20210211
Citation:
Zhang Zilin, Liu Duo, Tan Yujuan, Wu Yu, Luo Longpan, Wang Weilüe, Qiao Lei. An Erasure-Coded Data Update Method for Distributed Storage Clusters[J]. Journal of Computer Research and Development, 2022, 59(11): 2451-2466. DOI: 10.7544/issn1000-1239.20210211
Zhang Zilin, Liu Duo, Tan Yujuan, Wu Yu, Luo Longpan, Wang Weilüe, Qiao Lei. An Erasure-Coded Data Update Method for Distributed Storage Clusters[J]. Journal of Computer Research and Development, 2022, 59(11): 2451-2466. DOI: 10.7544/issn1000-1239.20210211
Citation:
Zhang Zilin, Liu Duo, Tan Yujuan, Wu Yu, Luo Longpan, Wang Weilüe, Qiao Lei. An Erasure-Coded Data Update Method for Distributed Storage Clusters[J]. Journal of Computer Research and Development, 2022, 59(11): 2451-2466. DOI: 10.7544/issn1000-1239.20210211
1(College of Computer Science, Chongqing University, Chongqing 400044)
2(Beijing Institute of Control Engineering, Beijing 100080)
Funds: This work was supported by the National Natural Science Foundation of China (62072059) and the Funds for Chongqing Distinguished Young Scholars (cstc2020jcyj-jqX0012).
Erasure coding is widely deployed in distributed storage clusters to provide data reliability, but the disk I/O overhead becomes a performance bottleneck when data updates are intensive. On the one hand, traditional data update strategies need to read the original data chunk, and then write new data when updating the data chunk. In the case of intensive updates, frequent write-after-read seriously affects the write performance of the storage clusters. On the other hand, the operations of updating the parity chunk include reading the increments randomly distributed in the log file and merging them with the data file, which also introduces additional disk seek overhead. In this paper, a data updating method, named PARD (parity logging with reserved space and data delta), is proposed to solve these problems. The main idea of PARD is to use the linear calculations of erasure coding to reduce write-after-read, and take advantage of the disk characteristics to reduce the disk seek overhead. PARD comprises three key design features: 1) Adopting in-place data updates and log-based parity updates. 2) Taking advantage of the linear calculations of erasure coding to construct the log based on data increments. For a series of write requests to the same data chunk, only the first update needs to read the original data chunk, and the subsequent update executes the pure write, which remarkably reduces the write-after-read. 3) According to the characteristics of disk, reserving space for the log at the end of data file to reduce the disk seek overhead of reading and writing log. Experiments show that when the chunk size is 4 MB, PARD gains at least, 30.4%, 47.0% and 82.0% improvements in update throughput compared with PLR, PARIX, and FO, respectively.