ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2016, Vol. 53 ›› Issue (2): 479-491.doi: 10.7544/issn1000-1239.2016.20148274

• 软件技术 • 上一篇    下一篇



  1. (哈尔滨工业大学计算机科学与技术学院 哈尔滨 150001) (
  • 出版日期: 2016-02-01
  • 基金资助: 

Reachability Query in Heterogeneous Information Networks

Yin Dan, Gao Hong, Zou Zhaonian, Li Jianzhong   

  1. (School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001)
  • Online: 2016-02-01

摘要: 随着图数据规模的爆炸式增长,其形式也越来越复杂.异构信息网可建模成包含多种类型的顶点和多种类型的边的图.例如,文献数据库、在线购物网站等.首次研究异构信息网上的可达性查询问题.利用不同类型顶点之间的关系,查询2个顶点满足路径模式的可达性,该问题的时间复杂度是多项式的.然而在大规模的网络上,每次查询遍历一遍网络的时间开销也是不能容忍的.现有的可达性查询问题主要分为2类: k跳可达性查询和带有标签约束的可达性查询.但是这2种问题的算法都不能用于解决异构信息网上的可达性查询问题.因此,为了实现高效的在线查询,提出一种新的索引结构,通过路径模式的分解,预先计算部分路径模式的可达信息.当在线查询到来时,在路径模式的偏序图上,快速找到索引结构中存在的路径子模式,高效地计算查询结果.在真实和人工数据集上进行了大量实验,验证了算法的有效性.

关键词: 异构信息网, 查询处理, 可达性, 路径模式, 索引

Abstract: With the size of graph data increasing explosively, the form of graph data is much more complicated. Heterogeneous information networks can be modeled as graphs, which contain multiple types of nodes and multiple types of edges, e.g., bibliographic database, online shopping website and knowledge graphs. Reachability query in heterogeneous information networks is investigated in this paper. By using different types of relationships between nodes, the reachability of nodes is queried while satisfying specific path schema. This problem is polynomial. However, the time costing can’t be tolerant by scanning the big network for answering one query. The existing reachability work can be classified into two categories: K-hop reachability query and label-constraint reachability query. But these techniques can’t be used for processing path-based reachability query in heterogeneous information networks. Therefore, in order to respond online queries efficiently, a novel index structure is proposed, which decomposes path schemas and precomputes the reachability of nodes in sub-path schemas. Online query is computed efficiently by decomposing the query path schema and using the reachability of the indexes. A path schema decomposition strategy is developed by searching the partial order graph of path schemas in order to minimize the query time. Experiments on real world and synthetic data demonstrate the effectiveness of algorithms for reachability query in heterogeneous information networks.

Key words: heterogeneous information network, query processing, reachability, path schema, index