Abstract:
Data integration is an effective solution to the problem of multiple data sources consolidation. It is of great importance to integrate schemas of multiple data sources accurately and efficiently. Although there have been a large number of researches on schema integration, they all neglect the history usage information of user which is a very important factor for improving the quality of schemas integration. In this paper, a novel clustering-based multi-schema integration method is proposed, which takes advantage of the usage information of the user. Firstly, a feature vector is extracted for each attribute of source schemas from the query log of a database, over which clustering is performed. Secondly, according to minimal difference among resulting clusters, a maximal similarity threshold is introduced to detect all intra-cluster exceptional points of different semantics for each resulting cluster. The points are departure core exceptional point, same source exceptional point, and excursion exceptional point respectively. Finally, aiming at three kinds of exceptional attributes within a resulting cluster, three exceptional points eliminating rules are proposed respectively, based on which a novel exceptional points eliminating algorithm, namely EPKO, is designed. Experimental results show that the proposed method can solve the problem of multiple schemas integration accurately and efficiently.