ISSN 1000-1239 CN 11-1777/TP

Journal of Computer Research and Development ›› 2021, Vol. 58 ›› Issue (4): 888-903.doi: 10.7544/issn1000-1239.2021.20190732

Previous Articles    

A Consistent Hash Data Placement Algorithm Based on Stripe

Wei Zheng1,2, Dou Yu1,2, Gao Yanzhen1,2, Ma Jie1, Sun Ninghui1, Xing Jing1   

  1. 1(State Key Laboratory of Computer Architecture (Institute of Computing Technology,Chinese Academy of Sciences),Beijing 100190);2(University of Chinese Academy of Sciences,Beijing 100190)
  • Online:2021-04-01
  • Supported by: 
    This work was supported by the National Key Research and Development Program of China (2018YFC0809300), the National Natural Science Foundation of China (61502454), and the Distributed Full Flash Project of ECR Team of Lenovo Research Institute.

Abstract: As the carrier of data storage, distributed storage system is widely used in the field of large data. Erasure codes are widely adopted by storage systems because of their high spatial efficiency and reliable data storage. In EB-level large-scale erasure coded distributed storage system, the cost of metadata management is high, and the query efficiency of metadata such as location information affects the I/O latency and throughput. The centralized data placement algorithm, based on location information records, needs frequent access to metadata servers, resulting in performance optimization constraints. More and more centerless data placement algorithms based on Hash mapping are applied. But some problems exist in the process of node change and data recovery, such as difficult location change, a large amount of migrated data, low concurrency of data recovery and migration. This paper proposes a consistent Hash data placement algorithm based on stripe (SCHash). SCHash places data in the unit of stripe. By transforming the mapping from data block to node into the mapping process from stripe to node group, it reduces the amount of data migration in the process of node change. Thus, in the recovery process, the proportion of data migration is reduced, and the recovery speed is accelerated. On the basis of SCHash, this paper designs and implements a recovery strategy of parallel I/O scheduling based on stripe. The recovery strategy avoids the selection of the data block in the same node in I/O operation, which also enhances the degree of parallelism of I/O. Compared with the APHash algorithm, SCHash algorithm reduces the data transfer by 46.71% to 85.28% in the data recovery. The recovery rate is improved by 48.16% when the nodes are rebuilt in the stripe, and the recovery rate is increased by 138.44% when the nodes are rebuilt out of the stripe.

Key words: distribute file system, erasure coding, consistent Hash, stripe, data placement, data recovery

CLC Number: