基于使用信息和聚类方法的多模式集成

丁国辉  王国仁  赵宇海

基于使用信息和聚类方法的多模式集成

丁国辉王国仁赵宇海

Multi-Schema Integration Based on Usage and Clustering Approach

Ding Guohui, Wang Guoren, and Zhao Yuhai

摘要

摘要: 数据集成是解决多数据源整合问题的有效手段.如何准确高效地集成多数据源模式具有重要研究意义.关于模式集成已有大量的研究工作，但均忽略了用户使用信息.在用户使用信息的基础上提出一种新颖的基于聚类技术的多模式数据集成方法.首先从数据库的查询日志中为模式属性提取特征向量，并对其进行聚类.然后根据结果聚类间的最小差异性，为每个结果聚类引入最大相似性阈值，利用该阈值发现结果聚类中与该类语义不相似的异常属性.最后针对结果聚类中的3类异常属性，设计3种异常属性去除规则，进一步提出异常属性去除算法EPKO.实验结果表明，该方法具有较高的准确度，可以有效地解决多个模式的集成问题.

Abstract: Data integration is an effective solution to the problem of multiple data sources consolidation. It is of great importance to integrate schemas of multiple data sources accurately and efficiently. Although there have been a large number of researches on schema integration, they all neglect the history usage information of user which is a very important factor for improving the quality of schemas integration. In this paper, a novel clustering-based multi-schema integration method is proposed, which takes advantage of the usage information of the user. Firstly, a feature vector is extracted for each attribute of source schemas from the query log of a database, over which clustering is performed. Secondly, according to minimal difference among resulting clusters, a maximal similarity threshold is introduced to detect all intra-cluster exceptional points of different semantics for each resulting cluster. The points are departure core exceptional point, same source exceptional point, and excursion exceptional point respectively. Finally, aiming at three kinds of exceptional attributes within a resulting cluster, three exceptional points eliminating rules are proposed respectively, based on which a novel exceptional points eliminating algorithm, namely EPKO, is designed. Experimental results show that the proposed method can solve the problem of multiple schemas integration accurately and efficiently.

HTML全文

参考文献(0)

施引文献

资源附件(0)