Abstract:
Fixing obsolete data to latest values is a common challenge when improving data quality. Previous methods of data repairing can be divided into two categories, that is, the methods based on quality rules and the methods based on statistic techniques. The former can express the domain knowledge, but fall short in their ability to detect and represent some complex relationships of data. The latter can fix some errors that quality rules cannot detect or repair, but the current methods need to learn complex conditional probability distribution, and they cannot involve domain knowledge effectively. To overcome the shortages of the above two kinds of methods, this paper focuses on combining quality rules and statistical techniques to improve data currency. A new class of rules for repairing data currency is proposed. Domain knowledge can be directly expressed by the antecedents and consequents of rules, and the statistical information can be described by the distribution tables corresponding to each rules. Based on these rules, the algorithms for learning repairing rules and fixing obsolete data are provided. The experiments based on both real and synthetic data prove the efficiency and effectiveness of the methods.