ISSN 1000-1239 CN 11-1777/TP

Journal of Computer Research and Development ›› 2017, Vol. 54 ›› Issue (8): 1833-1852.doi: 10.7544/issn1000-1239.2017.20170348

Special Issue: 2017人工智能前沿进展专题

Corpus Construction for Chinese Discourse Topic via Micro-Topic Scheme

Xi Xuefeng1,2,3, Chu Xiaomin1, Sun Qingying1, Zhou Guodong1   

  1. 1(School of Computer Science and Technology, Soochow University, Suzhou, Jiangsu 215000);2(Department of Computer Science and Engineering, Suzhou University of Science and Technology, Suzhou, Jiangsu 215009);3(Virtual Reality Key Laboratory of Intelligent Interaction and Application Technology of Suzhou, Suzhou, Jiangsu 215009)
  • Online:2017-08-01

Abstract: Currently discourse topic structure analysis is the fundamental research of natural language understanding. Due to the lack of a large number of high-quality discourse corpus resources, which are suitable for Chinese discourse analysis, it has seriously restricted the research of the relevant discourse topic computing models. In order to solve the above problems, we firstly study the theoretical representation system of Chinese discourse topic structure. From the theme-rheme theory, theory of English rhetorical structure and Pennsylvania discourse treebank system, research of Chinese complex sentence and sentence group, combined with Chinese characteristics, we propose a Chinese discourse micro-topic scheme based on theme-rheme theory and construct a Chinese discourse topic structure representation model based on the topic chain. Then, on the basis of the above, we adopt the top-down and backward search annotation strategy and the combination of the human machine and the corpus annotation method to construct the Chinese discourse topic corpus (CDTC). Moreover, we carry out a detailed statistical analysis of the CDTC which contains a total of 500 documents. Compared with the OntoNotes corpus and the generalized topic structure theory, this micro-topic scheme representation model has some advantages in theory and is consistent with the Chinese characteristics. Finally, the consistency test shows that CDTC can fully reflect the difficulty of Chinese discourse topic analysis, and can provide support for the relevant research.

Key words: discourse topic structure, theme-rheme theory, thematic progression, topic chain, corpus construction

CLC Number: