Abstract:
With the arrival of big data period, data analysis and processing are becoming a more important technology which the data center and Internet companies depend on. Mass data storage is a hotspot topic in big data analysis with the expansion of information and variety of data structure. Traditional distributed file systems are lack of the new demands in scalability, reliability and performance. In this paper, a cluster file system towards big data analysis is designed, which is named Clover. Clover uses the namespace management based on directory sharding and consistent hashing to solve the problem of metadata extension. It provides metadata consistency for distributed transactions through a modified two-phase commit protocol. Moreover, Clover presents a highly available mechanism based on the shared storage pool. It achieves metadata reliability with hot standby and global state recovery mechanism. The evaluation results reveal that Clover could improve metadata performance linearly with the average value from 5.13% to 159.32% by adding one metadata server. Namespace management and distributed transactions would cause the degradation of performance on multiple metadata servers, but the influence is negligible (less than 10%). Comparing with HDFS, Clover could keep the similar throughput and quickly recover from metadata server failures. Practical application tests show that Clover is suitable for building high scalable and high available storage system.