ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2015, Vol. 52 ›› Issue (12): 2857-2865.doi: 10.7544/issn1000-1239.2015.20140685

• 软件技术 • 上一篇    下一篇

概率数据库中近似函数依赖挖掘算法

苗东菁,刘显敏,李建中   

  1. (哈尔滨工业大学计算机科学与技术学院 哈尔滨 150001) (miaodongjing@hit.edu.cn)
  • 出版日期: 2015-12-01
  • 基金资助: 
    国家“九七三”重点基础研究发展计划基金项目(2012CB316200,2012CB316202);国家自然科学基金项目(61402130)

An Algorithm on Mining Approximate Functional Dependencies in Probabilistic Database

Miao Dongjing, Liu Xianmin, Li Jianzhong   

  1. (School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001)
  • Online: 2015-12-01

摘要: 一个近似函数依赖(approximate functional dependency, AFD)是一个几乎成立的函数依赖,目前大部分工作仅限于从一般数据上挖掘近似函数依赖.有时数据是被组织成概率数据的形式,为了从挖掘概率数据中挖掘出可用的近似函数依赖,定义了概率近似函数依赖,它不同于任何一种以往的定义,并给出了在不确定数据中,置信概率的动态规划求解算法,由于动态规划算法复杂度较高,导出了候选依赖的概率下界来进行剪枝,随后给出了基于字典序的挖掘方法以及相应的剪枝策略,最后,在真实和合成的数据集上进行充分的实验,说明了挖掘算法的可扩展性和剪枝策略的高效性,并展示了有趣的挖掘结果.

关键词: 近似函数依赖, 数据挖掘, 概率数据库, 数据质量, 不一致性

Abstract: An approximate functional dependency (AFD) is a functional dependency almost hold, and the most existing works are only able to mine AFDs from general data. Sometimes, data is stored in probabilistic database, in order to mine AFDs from such type of data, we define the probabilistic AFD, namely (λ, δ)-AFD which is different from the previous definition. We propose a dynamic programming to compute the confidence probability of a candidate AFD and check if the confidence probability is more than the probability threshold, however, as the high time complexity of dynamic programming, we derive the lower bound based on Chernoff bound to prune candidates as much as possible. Then, under help of the anti-monotone property, we propose a mining algorithm based on lexicographical order and some pruning criterions to speed up the mining process. At last, experiments are performed on the synthetic and the real-life data sets, and the results show the effectiveness of the pruning criterions and the scalability of our mining algorithm, and we show the interesting results mined from DBLP data set.

Key words: approximate functional dependency (AFD), data mining, probabilistic database, data quality, inconsistency

中图分类号: