分布式大数据函数依赖发现

李卫榜; 李战怀; 陈群; 姜涛; 刘海龙; 潘巍

doi:10.7544/issn1000-1239.2015.20140229

分布式大数据函数依赖发现

Functional Dependencies Discovering in Distributed Big Data

摘要

摘要: 在关系数据库中，函数依赖发现是一种十分重要的数据库分析技术，在知识发现、数据库语义分析、数据质量评估以及数据库设计等领域有着广泛的应用.现有的函数依赖发现算法主要针对集中式数据，通常仅适用于数据规模比较小的情况.在大数据背景下，分布式环境函数依赖发现更富有挑战性.提出了一种分布式环境下大数据的函数依赖发现算法，其基本思想是首先在各个节点利用本地数据并行进行函数依赖发现，基于以上发现的结果对函数依赖候选集进行剪枝，然后进一步利用函数依赖的左部(left hand side, LHS)的特征，对函数依赖候选集进行分组，针对每一组候选函数依赖并行执行分布式环境发现算法，最终得到所有函数依赖.对不同分组情况下所能检测的候选函数依赖数量进行了分析，在算法的执行过程中，综合考虑了数据迁移量和负载均衡的问题.在真实的大数据集上的实验表明，提出的检测算法在检测效率方面与已有方法相比有明显的提升.

Abstract: Discovering functional dependencies (FDs) from relational databases is an important database analysis technique, which has a wide range of applications in knowledge discovery, database semantics analysis, data quality assessment and database design. Existing functional dependencies discovery algorithms are mainly applied in centralized data, and are suitable to the case of small data size only. However, it is far more challenging to discover functional dependencies in distributed databases, especially with big data. In this paper, we propose a novel functional dependencies discovering approach in distributed big data. Firstly we execute functional dependencies discovering algorithm in parallel in each node, then prune the candidate set of functional dependencies based on the results of discovery. Secondly we group the candidate set of functional dependencies according to the features of candidate functional dependencies’ left hand side, and execute functional dependencies discovery algorithm based on each candidate set in parallel, and get all the functional dependency eventually. We analyze the number of candidate functions with regard to different groups, and data shipment and load balance are taken into account when discovering functional dependencies. Experiments on real-world big datasets demonstrate that compared with previous discovering methods, our approach is more effective in efficiency.

HTML全文

参考文献(0)

施引文献

资源附件(0)