Abstract:
Data sharing is a pervasive challenge faced in applications that need to query across multiple autonomous data sources. The task of integration becomes more complicated when data sources are distributed, heterogeneous, and high in number. One solution to the issues of distribution and scale is to perform data integration using P2P networks, but current P2P architectures are mostly flat, only specifying mappings directly between peers, and with no schemas abstraction provided. In this paper, a data sharing architecture similar to iXPeer is proposed to deal with integration on several levels of schema abstraction. Peers are grouped into local clusters according to their similarities. Peers with high similarities are clustered into one group, which can improve the query efficiency, reducing the computing cost. An aggregation model based on elements matching for autonomous data sources is proposed to construct clusters. TA, originally proposed in the context of database middleware, is applied to generate a list of best-ranked data source nodes, since TA may require time exponential in the size of the scale of cluster organization. TA is improved by adding labeling nodes, resulting in TAL, to generate the top-K cluster nodes. Experiments show that TA and TAL have good performances on top-K searching, especially for TAL, when the scale of clustered nodes is large.