ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2016, Vol. 53 ›› Issue (2): 316-325.doi: 10.7544/issn1000-1239.2016.20150872

所属专题: 2016数据融合与知识融合专题

• 软件技术 • 上一篇    下一篇

面向数据融合的半环溯源计算方法

薛见新,申德荣,寇月,聂铁铮,于戈   

  1. (东北大学信息科学与工程学院 沈阳 110819) (xuejianxin@research.neu.edu.cn)
  • 出版日期: 2016-02-01
  • 基金资助: 
    国家自然科学基金项目(61472070);国家“九七三”重点基础研究发展规划基金项目(2012CB316201)

Semiring Provenance for Data Fusion

Xue Jianxin, Shen Derong, Kou Yue, Nie Tiezheng, Yu Ge   

  1. (College of Information Science and Engineering, Northeastern University, Shenyang 110819)
  • Online: 2016-02-01

摘要: 数据融合是集成数据的质量保证和分析挖掘的前提条件;然而,数据融合作为一个整体对于用户来讲是一个黑盒过程,使得当前数据融合过程缺乏可解释性和可调试性.为了便于数据融合过程中有效的冲突检测和调试,需要利用数据溯源技术建立数据融合的可回溯机制.数据溯源描述了数据产生并随着时间推移而演变的整个过程,半环溯源模型作为一种经典的数据溯源表示形式,不仅能表示结果数据是由哪些数据派生的,而且还能够描述这些数据以什么方式进行派生.主要研究用于数据融合的半环溯源的计算问题.用于数据融合的半环溯源计算是一个pay as you go的模式,计算数据的溯源信息是一个非常耗时的过程.首先,提出一种基于Kleene序列的近似迭代方法,并证明了该方法与半环溯源的派生树定义的关系,从而证明了该方法的正确性.然后,提出了一种类牛顿序列,这种方法比Kleene序列有更好的收敛性.由于递归的引入可能会导致这2种迭代算法无法终止,通过分析结果元组的半环多项式溯源的特点,证明这2种近似算法最坏可在n次迭代后终止.最后,通过实验说明了本文提出的方法是可行和有效的.

关键词: 数据融合, 半环溯源, 多项式系统, 派生树, 递归查询

Abstract: As an important part of the Web data integration, Web data fusion is the quality assurance of integrated data and the precondition of accurate analysis and mining. However, being a uniform data fusion is treated as black box, which makes the fusion lack of interpretability and debuggable ability. Therefore, to describe fusion process and origin for solving the conflict, we should construct a provenance mechanism with data provenance. Data provenance describes about how data is generated and evolves with time going on, which can not only show which input tuples contribute to the data but also how they contribute. We study the semiring provenance for data fusion. Firstly, we propose an approximate iterative approach to optimal the computational process of semiring provenance. After, to speed up the convergence, we show a Newton-like approach. Recursion may make the situation complicated, we analysize the characteristic of semiring provenance and show that Kleene sequence and Newton-like sequence can convergent only after n step. And experiments show that the technologies in this paper are highly effective and feasible.

Key words: data fusion, semiring provenance, polynomial systems, derive trees, recursive queries

中图分类号: