汉语否定与不确定覆盖域检测
Negation and Speculation Scope Detection in Chinese
-
摘要: 自然语言文本中存在大量否定和不确定表述,识别这些信息并将其与确定性内容分离,对自然语言处理的下游应用,如信息抽取、信息检索、情感分析等,都具有十分重要的意义.与英语相比,面向汉语的否定与不确定覆盖域检测研究目前较为匮乏.提出了一个基于双向长短期记忆(bidirectional long short-term memory, BiLSTM)网络和条件随机场(conditional random fields, CRF)的融合模型,将覆盖域检测任务作为序列标注问题,针对给定的否定或不确定关键词,识别其在句子中的语义作用范围.该模型既具有LSTM(long short-term memory)网络能够利用前向与后向上下文信息的特性,同时又能够借助CRF 层获取输出标签之间的依赖关系,这得益于该框架能够有效地对序列信息及长距离上下文依赖信息进行编码的优势.在CNeSp语料集上的实验结果验证了模型的有效性,其中,在金融新闻子数据集上,否定与不确定覆盖域检测准确率分别达到79.16%和76.79%,比目前基于传统机器学习的汉语覆盖域检测方法分别提升了25.06%和34.46%.Abstract: There are a great deal of negative and speculative expressions in natural language texts. Identifying such information and separating them from the affirmative content plays a critical role in a variety of downstream applications of natural language processing, such as information extraction, information retrieval, and sentiment analysis. Compared with that in English, current research on negative and speculative scope detection for Chinese is scarce. In this paper, we come up with a fusion model based on bidirectional long-term memory (BiLSTM) networks and conditional random fields (CRF), and recast the scope detection problem as a sequence-labeling task. Given a negative or speculative keyword, we need to identify its semantic scope in sentence. This model can learn not only the forward and backward context information by LSTM networks but also the dependency relationship between the output labels via a CRF layer, which is motivated by the superiority of sequential architecture in effectively encoding order information and long-range context dependency. The experimental results on CNeSp corpus show the effectiveness of our proposed model. On the financial dataset, our approach achieves the performance of 79.16% and 76.79% with the improvements of 25.06% and 34.46% for negation and speculation, respectively, compared with the state-of-the-art.