Abstract:
A subword based dual-layer CRFs (conditional random fields) method for Chinese word segmentation is proposed, which aims to solve the problem of word segmentation disambiguation and unknown words recognition. Previous work in CRFs reported that the subword-based tagging outperforms the character-based tagging in all comparative experiments. However, subwords-based tagging often produces errors of cross word boundaries. This method is established on sequence labeling methods based on subwords, which are selected with a subword filtering algorithm. The learning process is divided into two: one for learning the first layer subword tagging CRF with character-based tagging, and the other for learning the second layer word tagging CRF with subword-based tagging. In word sequence labeling process, the first layer uses subword tagging CRFs model to recognize the subwords in testing corpora for reducing error rate generated by label spanning, and the second layer is used to subword-based sequence labeling and then to test the output of first layer to get the final result. The proposed method is evaluated using test data from SIGHAN Bakeoff 2006. F-score of 93.3% and 96.1% are achieved respectively in UPUC corpora and MSRA corpora. The experimental results show that this method can gain state-of-the-art performance on Chinese word segmentation.