基于子词的双层CRFs中文分词

黄德根  焦世斗  周惠巍

基于子词的双层CRFs中文分词

黄德根焦世斗周惠巍

Dual-Layer CRFs Based on Subword for Chinese Word Segmentation

Huang Degen, Jiao Shidou, and Zhou Huiwei

摘要

摘要: 提出了基于子词的双层CRFs(conditional random fields)中文分词方法，旨在解决中文分词中切分歧义与未登录词的问题.该方法是建立在基于子词的序列标注模型上.方法第1层利用基于字CRFs模型来识别待测语料中的子词，这样做是为了减少子词的跨越标记错误和增加子词识别的精确率；第2层利用CRFs模型学习基于子词的序列标注，对第1层的输出进行测试，进而得到分词结果.在2006年SIGHAN Bakeoff的中文简体语料上进行了测试，包括UPUC和MSRA语料，分别在F值上达到了93.3%和96.1%的精度.实验表明，基于子词的双层CRFs模型能够更加有效地利用子词来提高中文分词的精度.

Abstract: A subword based dual-layer CRFs (conditional random fields) method for Chinese word segmentation is proposed, which aims to solve the problem of word segmentation disambiguation and unknown words recognition. Previous work in CRFs reported that the subword-based tagging outperforms the character-based tagging in all comparative experiments. However, subwords-based tagging often produces errors of cross word boundaries. This method is established on sequence labeling methods based on subwords, which are selected with a subword filtering algorithm. The learning process is divided into two: one for learning the first layer subword tagging CRF with character-based tagging, and the other for learning the second layer word tagging CRF with subword-based tagging. In word sequence labeling process, the first layer uses subword tagging CRFs model to recognize the subwords in testing corpora for reducing error rate generated by label spanning, and the second layer is used to subword-based sequence labeling and then to test the output of first layer to get the final result. The proposed method is evaluated using test data from SIGHAN Bakeoff 2006. F-score of 93.3% and 96.1% are achieved respectively in UPUC corpora and MSRA corpora. The experimental results show that this method can gain state-of-the-art performance on Chinese word segmentation.

HTML全文

参考文献(0)

施引文献

资源附件(0)