概率数据库中近似函数依赖挖掘算法

苗东菁; 刘显敏; 李建中

doi:10.7544/issn1000-1239.2015.20140685

概率数据库中近似函数依赖挖掘算法

An Algorithm on Mining Approximate Functional Dependencies in Probabilistic Database

摘要

摘要: 一个近似函数依赖(approximate functional dependency, AFD)是一个几乎成立的函数依赖,目前大部分工作仅限于从一般数据上挖掘近似函数依赖.有时数据是被组织成概率数据的形式,为了从挖掘概率数据中挖掘出可用的近似函数依赖,定义了概率近似函数依赖,它不同于任何一种以往的定义,并给出了在不确定数据中,置信概率的动态规划求解算法,由于动态规划算法复杂度较高,导出了候选依赖的概率下界来进行剪枝,随后给出了基于字典序的挖掘方法以及相应的剪枝策略,最后,在真实和合成的数据集上进行充分的实验,说明了挖掘算法的可扩展性和剪枝策略的高效性,并展示了有趣的挖掘结果.

Abstract: An approximate functional dependency (AFD) is a functional dependency almost hold, and the most existing works are only able to mine AFDs from general data. Sometimes, data is stored in probabilistic database, in order to mine AFDs from such type of data, we define the probabilistic AFD, namely (λ, δ)-AFD which is different from the previous definition. We propose a dynamic programming to compute the confidence probability of a candidate AFD and check if the confidence probability is more than the probability threshold, however, as the high time complexity of dynamic programming, we derive the lower bound based on Chernoff bound to prune candidates as much as possible. Then, under help of the anti-monotone property, we propose a mining algorithm based on lexicographical order and some pruning criterions to speed up the mining process. At last, experiments are performed on the synthetic and the real-life data sets, and the results show the effectiveness of the pruning criterions and the scalability of our mining algorithm, and we show the interesting results mined from DBLP data set.

HTML全文

参考文献(0)

施引文献

资源附件(0)