ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2017, Vol. 54 ›› Issue (11): 2576-2585.doi: 10.7544/issn1000-1239.2017.20160578

• 信息处理 • 上一篇    下一篇

一种基于大规模知识库的语义相似性计算方法

张立波1,孙一涵2,罗铁坚2   

  1. 1(中国科学院软件研究所 北京 100190); 2(中国科学院大学 北京 101408) (zsmj@hotmail.com)
  • 出版日期: 2017-11-01
  • 基金资助: 
    中国科学院系统优化基金项目(Y42901VED2,Y42901VEB1,Y42901VEB2)

Calculate Semantic Similarity Based on Large Scale Knowledge Repository

Zhang Libo1, Sun Yihan2, Luo Tiejian2   

  1. 1(Institute of Software, Chinese Academy of Sciences, Beijing 100190); 2(University of Chinese Academy of Sciences, Beijing 101408)
  • Online: 2017-11-01

摘要: 人类知识总量不断增加,依靠人类产生的结构化大数据进行语义分析在推荐系统和信息检索等领域都有着重要的应用.在这些领域中,首要解决的问题是语义相似性计算,之前的研究通过运用以维基百科为代表的大规模知识库取得了一定突破,但是其中的路径并没有被充分利用.研究基于人类思考方式的双向最短路径算法进行单词和文本的相似性评估,以充分利用知识库中的路径信息.提出的算法通过在维基百科中抽取出颗粒度比词条更细密的节点之间的超链接关系,并首次验证了维基百科之间的普遍连通性,并对2个词条之间的平均最短路径长度进行评估.最后,在公开数据集上进行的实验结果显示,算法在单词相似度得分上明显优于现有算法,在文本相似度的得分上趋于先进水平.

关键词: 大规模知识库, 语义相似性, 维基百科, 最短距离, 连通性

Abstract: With the continuous growth of the total of human knowledge, semantic analysis on the basis of the structured big data generated by human is becoming more and more important in the application of the fields such as recommended system and information retrieval. It is a key problem to calculate semantic similarity in these fields. Previous studies acquired certain breakthrough by applying large scale knowledge repository, which was represented by Wikipedia, but the path in Wikipedia didn't be fully utilized. In this paper, we summarize and analyze the previous algorithms for evaluating semantic similarity based on Wikipedia. On this foundation, a bilateral shortest paths algorithm is provided, which can evaluate the similarity between words and texts on the basis of the way human beings think, so that it can take full advantage of the path information in the knowledge repository. We extract the hyperlink structure among nodes, whose granularity is finer than that of articles form Wikipedia, then verify the universal connectivity among Wikipedia and evaluate the average shortest path between any two articles. Besides, the presented algorithm evaluates word similarity and text similarity based on the public dataset respectively, and the result indicates the great effect obtained from our algorithm. In the end of the paper, the advantages and disadvantages of proposed algorithm are summed up, and the way to improve follow-up study is proposed.

Key words: large scale knowledge repository, semantic similarity, Wikipedia, shortest path, connectivity

中图分类号: