基于对象的OpenXML复合文件去重方法研究

阎芳; 李元章; 张全新; 谭毓安

doi:10.7544/issn1000-1239.2015.20140093

基于对象的OpenXML复合文件去重方法研究

¹(北京理工大学计算机学院北京 100086)
²(北京物资学院信息学院北京 101149) (yfjoy@163.com)

基金项目: 国家“八六三”高技术研究发展计划基金项目(2013AA01A212)；国家自然科学基金项目(61370063)；北京高等学校青年英才计划项目(YETP1532,YETP1178)

详细信息

中图分类号: TP311
计量
- 文章访问数: 1224
- HTML全文浏览量: 2
- PDF下载量: 628
出版历程
- 发布日期: 2015-06-30

Object-Based Data De-Duplication Method for OpenXML Compound Files

¹(School of Computer Science & Technology, Beijing Institute of Technology, Beijing 100086)
²(School of Information, Beijing Wuzi University, Beijing 101149)

摘要

摘要: 现有的重复数据删除技术大部分是基于变长分块(content defined chunking, CDC)算法的，不考虑不同文件类型的内容特征.这种方法以一种随机的方式确定分块边界并应用于所有文件类型，已经证明其非常适合于文本和简单内容，而不适合非结构化数据构成的复合文件.分析了OpenXML标准的复合文件属性，给出了对象提取的基本方法，并提出基于对象分布和对象结构的去重粒度确定算法.目的是对于非结构化数据构成的复合文件，有效地检测不同文件中和同一文件不同位置的相同对象，在文件物理布局改变时也能够有效去重.通过对典型的非结构化数据集合的模拟实验表明，在综合情况下，对象重复数据删除比CDC方法提高了10%左右的非结构化数据的去重率.
- 变长分块 /
- 对象 /
- 非结构化数据 /
- OpenXML标准 /
- 复合文件 /
- 重复数据删除
Abstract: Content defined chunking (CDC) is a prevalent data de-duplication algorithm for removing redundant data segments in storage systems. Current researches on CDC do not consider the unique content characteristic of different file types, and they determine chunk boundaries in a random way and apply a single strategy for all the file types. It has been proven that such method is suitable for text and simple contents, and it doesn’t achieve the optimal performance for compound files. Compound file is composed of unstructured data, usually occupying large storage space and containing multimedia data. Object-based data de-duplication is the current most advanced method and is the effective solution for detecting duplicate data for such files. We analyze the content characteristic of OpenXML files and develop an object extraction method. A de-duplication granularity determining algorithm based on the object structure and distribution is proposed during this process. The purpose is to effectively detect the same objects in a file or between the different files, and to be effectively de-duplicated when the file physical layout is changed for compound files. Through the simulation experiments with typical unstructured data collection, the efficiency is promoted by 10% compared with CDC method in the unstructured data in general.
- content defined chunking (CDC) /
- object /
- unstructured data /
- OpenXML standard /
- compound file /
- data de-duplication

HTML全文

参考文献(0)

施引文献(14)

期刊类型引用(4)

1.	张晓宇，程小康，吴向前. 基于深度学习的视频多对象视觉检测和追踪方法. 计算机工程与设计. 2023(12): 3761-3771 . 百度学术
2.	谷南南，姚佩阳，焦志强. MapReduce平台上面向大规模Web服务组合问题的并行引导变异进化算法. 计算机应用研究. 2020(11): 3302-3306+3311 . 百度学术
3.	刘军，赵东杰，杨玺. 仓储自适应自动执行系统的构成要素及其运行机理. 中国流通经济. 2019(02): 3-10 . 百度学术
4.	马廷伟，刘军，李冀舒. 粒计算应用研究综述. 科技与创新. 2019(12): 37-38+41 . 百度学术