Abstract:
In the schema of Chinese chunking, the words are firstly combined into base-chunks, and then the base-chunks are further combined into functional chunks, and finally formalized into a hierarchical syntactic structure. In this paper, the problem of automatic labeling of Chinese functional chunks is modeled as a sequential labeling task, and then words and base chunks are regarded as labeling units of the Chinese functional chunk labeling models. For each of the labeling models a series of new features on the level of base-chunks are constructed, and conditional random fields model is employed in the model. The data set in the experiments is selected from Tsinghua Chinese Treebank (TCT) corpus, and split into train set and test set according to the proportion of 8∶2. The experimental results show that in comparison with the model in which the feature set at word level is only used, the addition of some base-chunk features can significantly improve the performance of functional chunk labeling. The proposed functional chunk labeling method based on human-corrected base-chunks can achieve precision of 88.47%, recall of 89.93% and F-measure of 89.19%. When auto-parsed base-chunks are used, the labeling of Chinese functional chunks achieves precision of 84.27%, recall of 85.57% and F-measure of 84.92%.