ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2015, Vol. 52 ›› Issue (1): 130-140.doi: 10.7544/issn1000-1239.2015.20130691

• 软件技术 • 上一篇    下一篇

一种扩展条件函数依赖的发现算法

刘显敏,李建中   

  1. (哈尔滨工业大学 哈尔滨 150001) (xianmliu@gmail.com)
  • 出版日期: 2015-01-01
  • 基金资助: 
    基金项目:国家“九七三”重点基础研究发展计划基金项目(2012CB316200)|国家自然科学基金青年基金项目(61003046)

Discovering Extended Conditional Functional Dependencies

Liu Xianmin, Li Jianzhong   

  1. (Harbin Institute of Technology, Harbin 150001)
  • Online: 2015-01-01

摘要: 扩展条件函数依赖(extended conditional functional dependency, eCFD)是一种描述数据一致性的语义规则,是条件函数依赖(conditional functional dependency, CFD)的扩展.相比于CFD,eCFD能够描述更多的模式从而表达更丰富的语义信息.然而,关注eCFD的研究工作并不多.从给定数据中发现eCFD规则是一个重要问题,据笔者所知,目前还没有这方面的工作.该问题的难点在于,给定数据中所有合法的eCFD规则之间存在不一致的情况,且包含大量冗余,而CFD和传统的函数依赖规则并没有这样的问题.为避免不一致,同时尽可能地消除冗余,定义了“强合法eCFD”和“近似无冗余eCFD”.基于这些概念给出了eCFD发现问题的形式化定义,并给出了MeCFD算法.利用划分属性的方法,MeCFD首先生成所有的基本eCFD,然后,通过合并基本eCFD来构造“组合eCFD”.使用先深序来搜索候选空间,使得MeCFD仅用常数的存储空间来维护数据划分,节省了大量的空间开销,有效的剪枝策略被用来改进MeCFD的性能.真实数据集上的实验结果显示出MeCFD良好的可扩展性以及剪枝策略和优化方法的有效性.

关键词: 扩展条件函数依赖, 发现算法, 搜索算法, 剪枝策略, 冗余

Abstract: eCFD (extended conditional functional dependency) is proposed as the extension of CFD (conditional functional dependency) for data cleaning. Compared with CFD, eCFD can take more patterns of values and catch more semantic information. However, there are only few works about eCFD. This paper focuses on the problem of eCFD discovering, whose counterpart of CFD has been studied very much. As we know, this paper is the first work about eCFD discovering. To avoid inconsistencies and remove redundancies, based on the definitions of strongly validated and weakly non-redundant eCFDs, formal definition of eCFD discovering problem is given and MeCFD method is proposed to solve this problem. MeCFD first generates all basic eCFDs which are weakly non-redundant and semantically equivalent to all strongly validated eCFDs, then constructs compound eCFDs through merging basic eCFDs. Searching candidate space in depth-first order makes MeCFD use only constant memory space to maintain data partitions. Efficient pruning strategies are proposed to improve the performance of MeCFD. Theoretical analysis shows the correctness of MeCFD. Experiments over real data sets show the good scalability of MeCFD and the effectiveness of pruning strategies and optimizing methods.

Key words: extended conditional functional dependency (eCFD), discovering algorithm, search algorithm, pruning strategy, redundancy

中图分类号: