Advanced Search
    Yan Fang, Li Yuanzhang, Zhang Quanxin, Tan Yu’an. Object-Based Data De-Duplication Method for OpenXML Compound Files[J]. Journal of Computer Research and Development, 2015, 52(7): 1546-1557. DOI: 10.7544/issn1000-1239.2015.20140093
    Citation: Yan Fang, Li Yuanzhang, Zhang Quanxin, Tan Yu’an. Object-Based Data De-Duplication Method for OpenXML Compound Files[J]. Journal of Computer Research and Development, 2015, 52(7): 1546-1557. DOI: 10.7544/issn1000-1239.2015.20140093

    Object-Based Data De-Duplication Method for OpenXML Compound Files

    • Content defined chunking (CDC) is a prevalent data de-duplication algorithm for removing redundant data segments in storage systems. Current researches on CDC do not consider the unique content characteristic of different file types, and they determine chunk boundaries in a random way and apply a single strategy for all the file types. It has been proven that such method is suitable for text and simple contents, and it doesn’t achieve the optimal performance for compound files. Compound file is composed of unstructured data, usually occupying large storage space and containing multimedia data. Object-based data de-duplication is the current most advanced method and is the effective solution for detecting duplicate data for such files. We analyze the content characteristic of OpenXML files and develop an object extraction method. A de-duplication granularity determining algorithm based on the object structure and distribution is proposed during this process. The purpose is to effectively detect the same objects in a file or between the different files, and to be effectively de-duplicated when the file physical layout is changed for compound files. Through the simulation experiments with typical unstructured data collection, the efficiency is promoted by 10% compared with CDC method in the unstructured data in general.
    • loading

    Catalog

      Turn off MathJax
      Article Contents

      /

      DownLoad:  Full-Size Img  PowerPoint
      Return
      Return