ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2015, Vol. 52 ›› Issue (9): 1992-2001.doi: 10.7544/issn1000-1239.2015.20140687

• 软件技术 • 上一篇    下一篇

数据时效性修复问题的求解算法

李默涵,李建中   

  1. (哈尔滨工业大学计算机科学与技术学院 哈尔滨 150001) (limohan.hit@gmail.com)
  • 出版日期: 2015-09-01
  • 基金资助: 
    基金项目:国家“九七三”重点基础研究发展计划基金项目(2012CB316200);国家自然科学基金重点项目(61133002)

Algorithms for Improving Data Currency

Li Mohan, Li Jianzhong   

  1. (School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001)
  • Online: 2015-09-01

摘要: 数据过时问题普遍存在于实际应用中,因此将数据库中的过时数据修复为最新值是提高数据质量的关键步骤.当前主要有基于规则和基于统计2类数据修复方法.基于规则的修复方法能够将领域知识直观地表达为规则的形式,但是难以表达数据中某些复杂的关联关系;基于统计的方法能够表达数据中的复杂关联关系,并修复许多通过规则难以发现和修复的错误,但是该类方法均需要学习较复杂的条件概率分布,且难以直接应用数据语义相关的领域知识.研究数据时效性的修复问题,同时,为了克服当前2类数据修复方法的缺点,提出一类新的修复规则,将规则和统计的方法结合起来修复过时数据.该规则一方面能够以传统规则的方式表达领域知识,另一方面还能够使用其特有的分布表来描述数据随时间变化的统计信息.接着,还给出了修复规则学习算法和数据时效性修复算法.真实和虚拟数据上的实验均验证了算法的有效性.

关键词: 数据质量, 数据时效性, 数据修复, 数据质量规则, 分布表

Abstract: Fixing obsolete data to latest values is a common challenge when improving data quality. Previous methods of data repairing can be divided into two categories, that is, the methods based on quality rules and the methods based on statistic techniques. The former can express the domain knowledge, but fall short in their ability to detect and represent some complex relationships of data. The latter can fix some errors that quality rules cannot detect or repair, but the current methods need to learn complex conditional probability distribution, and they cannot involve domain knowledge effectively. To overcome the shortages of the above two kinds of methods, this paper focuses on combining quality rules and statistical techniques to improve data currency. A new class of rules for repairing data currency is proposed. Domain knowledge can be directly expressed by the antecedents and consequents of rules, and the statistical information can be described by the distribution tables corresponding to each rules. Based on these rules, the algorithms for learning repairing rules and fixing obsolete data are provided. The experiments based on both real and synthetic data prove the efficiency and effectiveness of the methods.

Key words: data quality, data currency, data repairing, data quality rules, distribution table

中图分类号: