ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2017, Vol. 54 ›› Issue (3): 586-596.doi: 10.7544/issn1000-1239.2017.20151048

• 软件技术 • 上一篇    下一篇



  1. 1(高可信软件技术教育部重点实验室(北京大学) 北京 100871); 2(北京大学信息科学技术学院 北京 100871); 3(软件工程国家工程研究中心(北京大学) 北京 100871) (
  • 出版日期: 2017-03-01
  • 基金资助: 

Incremental and Interactive Data Integration Approach for Hierarchical Data in Domain of Intelligent Livelihood

Xia Ding1,2, Wang Yasha1,3, Zhao Zipeng1,2, Cui Da1,2   

  1. 1(Key Laboratory of High Confidence Software Technologies (Peking University), Ministry of Education, Beijing 100871); 2(School of Electronics Engineering and Computer Science, Peking University, Beijing 100871); 3(National Engineering & Research Center of Software Engineering (Peking University), Beijing 100871)
  • Online: 2017-03-01

摘要: 智慧民生作为智慧城市的重点领域,包含众多应用系统,积累了大量层次结构数据.为了形成城市范围完整数据集,需要集成并统一异构的数据模式,向用户提供统一的数据视图.针对智慧民生领域的领域知识宽泛、缺乏中文语义匹配支持、模式数量众多、元素标签缺失但实例数据丰富等几方面特点,提出了一种增量交互式模式集成方法.该方法采用增量迭代的方式逐步完成多模式集成任务,大幅降低集成计算量;在模式匹配阶段,综合利用模式信息和实例数据构造了多种适用于中文且能力互补的匹配器,并通过相似度熵来度量机器的决策置信度,适度引入人工干预;在中介模式生成阶段,处理模式间可能出现的各种冲突,最终输出全局统一的中介模式.利用从互联网爬取的多源二手房数据设计并完成实验,实验结果表明:此方法在人工干预程度足够小的前提下,具有较好的模式匹配准确性.

关键词: 模式匹配, 模式集成, 数据集成, 智慧城市, 智慧民生

Abstract: Intelligent livelihood is an important domain of the smart city. In this domain, there are many application systems that have accumulated a large number of multi-source hierarchical data. In order to form an overall and unified view of the multi-source data in the whole city, variant data schemas should be integrated. Considering the distinct characteristics of the data from intelligent livelihood domain, such as lacking support for semantic matching of Chinese labels, numerous quantities of schemas, missing element labels, the existing schema integration approaches are not suitable. Under such circumstances, we propose an incremental and iterative approach which can deduce the massive computation workload due to the big number of schemas. In each iteration, both meta information and instance data are used to create more precise results, and a similarity entropy based criteria is carefully introduced to control the human intervention. Experiments are also conducted based on real data of second-hand housing in Beijing fetched from multiple second-hand Web applications. The results show that our approach can get high matching accuracy with only little human interventions.

Key words: schema matching, schema integration, data integration, smart city, intelligent livelihood