ISSN 1000-1239 CN 11-1777/TP

Journal of Computer Research and Development ›› 2020, Vol. 57 ›› Issue (9): 1939-1948.doi: 10.7544/issn1000-1239.2020.20190570

Previous Articles     Next Articles

Discovering Consistency Constraints for Associated Data on Heterogeneous Schemas

Du Yuefeng, Li Xiaoguang, Song Baoyan   

  1. (Information College, Liaoning University, Shenyang 110136)
  • Online:2020-09-01
  • Supported by: 
    This work was supported by the National Natural Science Foundation of China (U1811261), the Project of Liaoning Provincial Public Opinion and Network Security Big Data System Engineering Laboratory, and the Natural Science Foundation of Liaoning Province.

Abstract: Data consistency is a central issue of data quality management. With capability of expressing data relationship abstractly and formally, constraints are a technique for data consistency management. However, the diversity on heterogeneous schemas from multi-source brings great challenges to data consistency management, especially for constraints fusion. Besides, for both data from single-sources and multi-sources, they are related. These relationships can be used to strengthen the expression of constraints for semantics, which helps to probe potential data error. In practice, CINDs (conditional inclusion dependencies) and CCFDs (content-related conditional functional dependencies) are two effective techniques respectively for attributes match under heterogeneous schemas and consistency maintenance on content-related data. Based on this, we study how to discover consistency constraints for associated data on heterogeneous schemas. We firstly investigate the three fundamental problems related to CCFDs discovery. And we also illustrate that the implication, satisfiability and validation problems are NP-complete, coNP-complete, PTIME. Aiming at searching for the CCFDs in the space entirely, we present 2-level lattice according to the division between the conditional attribute set and the variable attribute set of CCFDs. After that an incremental method of discovering the fusion constraints over CINDs and CCFDs is proposed, which combines CCFDs on heterogeneous schemas via CINDs. Finally, our method is experimentally verified effectively and scalablely by using two real-life data.

Key words: heterogeneous schemas, associated data, CINDs (conditional inclusion dependencies), CCFDs (content-related conditional functional dependencies), constraints discovery

CLC Number: