自治异构数据源聚集模型与算法研究

王  博      郭  波

自治异构数据源聚集模型与算法研究

王博郭波

Study of Aggregation Process Model and Algorithms of Autonomy Heterogeneous Data Sources

Wang Bo and Guo Bo

摘要

摘要: 自治异构数据源信息共享的主要问题是如何在P2P环境下对自治数据节点的信息进行统一访问.采用分层结构组织数据源节点能够提高查询效率，减小计算开销，但需要节点根据彼此相似度实现局部的聚类.给出了数据源节点信息发布的形式化描述，提出了基于模式元素匹配的自治异构数据源多重聚集模型以及聚类组织构建过程，采用TA算法解决top-K聚类节点搜索问题，并在此基础上提出TAL算法.实验结果表明，TA和TAL算法能够高效地解决节点聚类排序的问题，特别是TAL算法在聚类节点范围较大时计算性能优于TA.

Abstract: Data sharing is a pervasive challenge faced in applications that need to query across multiple autonomous data sources. The task of integration becomes more complicated when data sources are distributed, heterogeneous, and high in number. One solution to the issues of distribution and scale is to perform data integration using P2P networks, but current P2P architectures are mostly flat, only specifying mappings directly between peers, and with no schemas abstraction provided. In this paper, a data sharing architecture similar to iXPeer is proposed to deal with integration on several levels of schema abstraction. Peers are grouped into local clusters according to their similarities. Peers with high similarities are clustered into one group, which can improve the query efficiency, reducing the computing cost. An aggregation model based on elements matching for autonomous data sources is proposed to construct clusters. TA, originally proposed in the context of database middleware, is applied to generate a list of best-ranked data source nodes, since TA may require time exponential in the size of the scale of cluster organization. TA is improved by adding labeling nodes, resulting in TAL, to generate the top-K cluster nodes. Experiments show that TA and TAL have good performances on top-K searching, especially for TAL, when the scale of clustered nodes is large.

HTML全文

参考文献(0)

施引文献

资源附件(0)