数据时效性修复问题的求解算法

李默涵; 李建中

doi:10.7544/issn1000-1239.2015.20140687

数据时效性修复问题的求解算法

Algorithms for Improving Data Currency

摘要

摘要: 数据过时问题普遍存在于实际应用中，因此将数据库中的过时数据修复为最新值是提高数据质量的关键步骤.当前主要有基于规则和基于统计2类数据修复方法.基于规则的修复方法能够将领域知识直观地表达为规则的形式，但是难以表达数据中某些复杂的关联关系；基于统计的方法能够表达数据中的复杂关联关系，并修复许多通过规则难以发现和修复的错误，但是该类方法均需要学习较复杂的条件概率分布，且难以直接应用数据语义相关的领域知识.研究数据时效性的修复问题，同时，为了克服当前2类数据修复方法的缺点，提出一类新的修复规则，将规则和统计的方法结合起来修复过时数据.该规则一方面能够以传统规则的方式表达领域知识，另一方面还能够使用其特有的分布表来描述数据随时间变化的统计信息.接着，还给出了修复规则学习算法和数据时效性修复算法.真实和虚拟数据上的实验均验证了算法的有效性.

Abstract: Fixing obsolete data to latest values is a common challenge when improving data quality. Previous methods of data repairing can be divided into two categories, that is, the methods based on quality rules and the methods based on statistic techniques. The former can express the domain knowledge, but fall short in their ability to detect and represent some complex relationships of data. The latter can fix some errors that quality rules cannot detect or repair, but the current methods need to learn complex conditional probability distribution, and they cannot involve domain knowledge effectively. To overcome the shortages of the above two kinds of methods, this paper focuses on combining quality rules and statistical techniques to improve data currency. A new class of rules for repairing data currency is proposed. Domain knowledge can be directly expressed by the antecedents and consequents of rules, and the statistical information can be described by the distribution tables corresponding to each rules. Based on these rules, the algorithms for learning repairing rules and fixing obsolete data are provided. The experiments based on both real and synthetic data prove the efficiency and effectiveness of the methods.

HTML全文

参考文献(0)

施引文献

资源附件(0)