基于多Web信息源的主题概念网络获取

许  焱; 金  芝; 李  戈; 魏  强

基于多Web信息源的主题概念网络获取

Acquiring Topical Concept Network from Multiple Web Information Sources

摘要

摘要: Wikipedia一方面能够提供关于特定百科条目的概念性描述；另一方面，也通过分类系统将这些百科条目组织成一个概念网络.它对信息的广泛覆盖和有效组织使其成为了自动化知识获取的常用信息源.然而，仅仅依靠Wikipedia自身的信息，还不足以准确地刻画其内部概念间的关联性知识，而这是符号化知识表述的一个重要组成部分.因此，提出了一种基于多Web信息源的主题概念网络获取方法.它以Wikipedia的分类系统为基础，同时利用搜索引擎收集相关的Web信息作为关联性知识验证和发现的参照系，并通过集成信息检索和自然语言处理等领域的方法，实现了以给定的主题词为核心，在Wikipedia分类系统对应的概念网络中获取面向该主题的概念网络，同时网络内的概念间关系得到识别和标注.我们基于不同领域的主题词进行了实验，对实验结果的经验性评估展示了所获取的主题概念网络既能满足面向主题的要求，其内部的概念关联性知识又具备了一定的精度要求.

Abstract: Wikipedia provides conceptual description for specific entry and organizes these entries to form a concept category system. It has become a common information source for automatic knowledge acquisition. However, only relying on Wikipedia’s information is not enough for acquiring the relationships between the concepts, while such relationships are one of the important components of symbolic knowledge representation. Other kinds of information sources are needed for this purpose. Therefore, we propose an approach for acquiring the relationships between the concepts from multiple Web information sources. These concept relationships will form a topical concept network. This approach conducts the following steps. First, based on a provided concept, named as the topical term, it obtains a group of concepts and the links between them from the Wikipedia category system. The concept group is centered on the topical term by some kind of relevance. Secondly, it exploits the search engine for collecting the related Web information as references for discovering and verifying the relationships between the concepts in the concept group by integrating different well-established methods in the information retrieval and natural language processing fields. Finally, it produces a topical concept network, in which the nodes concepts obtained in the first step and the edges are the relationships obtained in the second step. The experiments have been conducted on several topical terms from different domains and the results shows the feasibility and the effectiveness of the proposed approach.

HTML全文

参考文献(0)

施引文献

资源附件(0)