面向数据融合的半环溯源计算方法

薛见新; 申德荣; 寇月; 聂铁铮; 于戈

doi:10.7544/issn1000-1239.2016.20150872

面向数据融合的半环溯源计算方法

Semiring Provenance for Data Fusion

摘要

摘要: 数据融合是集成数据的质量保证和分析挖掘的前提条件；然而，数据融合作为一个整体对于用户来讲是一个黑盒过程，使得当前数据融合过程缺乏可解释性和可调试性.为了便于数据融合过程中有效的冲突检测和调试，需要利用数据溯源技术建立数据融合的可回溯机制.数据溯源描述了数据产生并随着时间推移而演变的整个过程，半环溯源模型作为一种经典的数据溯源表示形式,不仅能表示结果数据是由哪些数据派生的，而且还能够描述这些数据以什么方式进行派生.主要研究用于数据融合的半环溯源的计算问题.用于数据融合的半环溯源计算是一个pay as you go的模式，计算数据的溯源信息是一个非常耗时的过程.首先，提出一种基于Kleene序列的近似迭代方法，并证明了该方法与半环溯源的派生树定义的关系，从而证明了该方法的正确性.然后，提出了一种类牛顿序列，这种方法比Kleene序列有更好的收敛性.由于递归的引入可能会导致这2种迭代算法无法终止，通过分析结果元组的半环多项式溯源的特点，证明这2种近似算法最坏可在n次迭代后终止.最后，通过实验说明了本文提出的方法是可行和有效的.

Abstract: As an important part of the Web data integration, Web data fusion is the quality assurance of integrated data and the precondition of accurate analysis and mining. However, being a uniform data fusion is treated as black box, which makes the fusion lack of interpretability and debuggable ability. Therefore, to describe fusion process and origin for solving the conflict, we should construct a provenance mechanism with data provenance. Data provenance describes about how data is generated and evolves with time going on, which can not only show which input tuples contribute to the data but also how they contribute. We study the semiring provenance for data fusion. Firstly, we propose an approximate iterative approach to optimal the computational process of semiring provenance. After, to speed up the convergence, we show a Newton-like approach. Recursion may make the situation complicated, we analysize the characteristic of semiring provenance and show that Kleene sequence and Newton-like sequence can convergent only after n step. And experiments show that the technologies in this paper are highly effective and feasible.

HTML全文

参考文献(0)

施引文献

资源附件(0)