ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2020, Vol. 57 ›› Issue (9): 1939-1948.doi: 10.7544/issn1000-1239.2020.20190570

• 软件技术 • 上一篇    下一篇

异构模式中关联数据的一致性规则发现方法

杜岳峰,李晓光,宋宝燕   

  1. (辽宁大学信息学院 沈阳 110136) (duyuefeng@lnu.edu.cn)
  • 出版日期: 2020-09-01
  • 基金资助: 
    国家自然科学基金项目(U1811261);辽宁公共舆情与网络安全大数据系统工程实验室专项资金;辽宁省自然科学基金项目

Discovering Consistency Constraints for Associated Data on Heterogeneous Schemas

Du Yuefeng, Li Xiaoguang, Song Baoyan   

  1. (Information College, Liaoning University, Shenyang 110136)
  • Online: 2020-09-01
  • Supported by: 
    This work was supported by the National Natural Science Foundation of China (U1811261), the Project of Liaoning Provincial Public Opinion and Network Security Big Data System Engineering Laboratory, and the Natural Science Foundation of Liaoning Province.

摘要: 数据一致性是数据质量管理的一项核心事务.规则约束作为一种抽象化、形式化的数据关系表达技术,可以有效地进行数据一致性管理.但是,在进行多源数据一致性管理的过程中,由于异源数据所属的关系模式不同,给一致性规则融合带来了挑战.另外,不论同源数据还是异源数据,数据之间是相互关联的,可以利用这种关系强化规则约束中语义含义的表达作用,发现数据中的潜在错误.具体地,条件包含依赖(conditional inclusion dependencies, CINDs)和内容相关的条件函数依赖(content-related conditional functional dependencies, CCFDs)可以分别用于异构模式的属性匹配和内容关联数据的一致性维护.基于此,对面向异构关系模式中关于关联数据的一致性规则发现问题进行研究.首先,针对使用CINDs进行异构模式中CCFDs规则发现的基本问题进行分析,对规则发现的可满足性、蕴含性和可验证性问题进行解释,它们分别满足NP-complete,coNP-complete,PTIME的复杂性判定问题.其次,为了对规则空间内的全部CCFDs进行发现,以CCFDs中的条件属性和变量属性为划分依据,提出了一种2级lattice的搜索结构.再次,设计了一种基于CINDs和CCFDs的异构关联数据一致性规则发现方法,使用CINDs对规则形式进行融合,而后通过增量发现方式查找一致性规则.最后,通过在2组真实数据进行实验,验证了方法的有效性和高效性.

关键词: 异构关系模式, 关联数据, 条件包含依赖, 内容相关的条件函数依赖, 规则发现

Abstract: Data consistency is a central issue of data quality management. With capability of expressing data relationship abstractly and formally, constraints are a technique for data consistency management. However, the diversity on heterogeneous schemas from multi-source brings great challenges to data consistency management, especially for constraints fusion. Besides, for both data from single-sources and multi-sources, they are related. These relationships can be used to strengthen the expression of constraints for semantics, which helps to probe potential data error. In practice, CINDs (conditional inclusion dependencies) and CCFDs (content-related conditional functional dependencies) are two effective techniques respectively for attributes match under heterogeneous schemas and consistency maintenance on content-related data. Based on this, we study how to discover consistency constraints for associated data on heterogeneous schemas. We firstly investigate the three fundamental problems related to CCFDs discovery. And we also illustrate that the implication, satisfiability and validation problems are NP-complete, coNP-complete, PTIME. Aiming at searching for the CCFDs in the space entirely, we present 2-level lattice according to the division between the conditional attribute set and the variable attribute set of CCFDs. After that an incremental method of discovering the fusion constraints over CINDs and CCFDs is proposed, which combines CCFDs on heterogeneous schemas via CINDs. Finally, our method is experimentally verified effectively and scalablely by using two real-life data.

Key words: heterogeneous schemas, associated data, CINDs (conditional inclusion dependencies), CCFDs (content-related conditional functional dependencies), constraints discovery

中图分类号: