Abstract:
The application of data deduplication has been extended to the primary storage systems like cloud computing. In those systems,the reading performance has become a very important factor because of the high demand of reading response time. However, not so much attention has been paid to reading performance in the area of data deduplication. In this paper, we analyze the reading process and bottleneck in this area,and propose a reading model based on pipeline (RMBP). And we additionally improve this reading model using the mechanism of parallel calculation. Then we do theoretical analysis to evaluate its effect in the improvement of reading speed. Furthermore, we design a paralleled and pipelined data deduplication system based on this reading model. We also do experiments using three different kinds of data in this system. The experimental results show that: the system using RMBP can increase the reading speed with all kinds of the experimental data; for the network security logs and the virtual machine image data, the system using RMBP can get a 5 times higher reading throughput; RMBP can significantly improve the reading performance in scenarios of different data deduplicaiton ratio, and has good extensive applicability hence.