ISSN 1000-1239 CN 11-1777/TP

Journal of Computer Research and Development ›› 2017, Vol. 54 ›› Issue (11): 2576-2585.doi: 10.7544/issn1000-1239.2017.20160578

Previous Articles     Next Articles

Calculate Semantic Similarity Based on Large Scale Knowledge Repository

Zhang Libo1, Sun Yihan2, Luo Tiejian2   

  1. 1(Institute of Software, Chinese Academy of Sciences, Beijing 100190); 2(University of Chinese Academy of Sciences, Beijing 101408)
  • Online:2017-11-01

Abstract: With the continuous growth of the total of human knowledge, semantic analysis on the basis of the structured big data generated by human is becoming more and more important in the application of the fields such as recommended system and information retrieval. It is a key problem to calculate semantic similarity in these fields. Previous studies acquired certain breakthrough by applying large scale knowledge repository, which was represented by Wikipedia, but the path in Wikipedia didn't be fully utilized. In this paper, we summarize and analyze the previous algorithms for evaluating semantic similarity based on Wikipedia. On this foundation, a bilateral shortest paths algorithm is provided, which can evaluate the similarity between words and texts on the basis of the way human beings think, so that it can take full advantage of the path information in the knowledge repository. We extract the hyperlink structure among nodes, whose granularity is finer than that of articles form Wikipedia, then verify the universal connectivity among Wikipedia and evaluate the average shortest path between any two articles. Besides, the presented algorithm evaluates word similarity and text similarity based on the public dataset respectively, and the result indicates the great effect obtained from our algorithm. In the end of the paper, the advantages and disadvantages of proposed algorithm are summed up, and the way to improve follow-up study is proposed.

Key words: large scale knowledge repository, semantic similarity, Wikipedia, shortest path, connectivity

CLC Number: