基于条件随机场模型的汉语功能块自动标注

李国臣; 王瑞波; 李济洪

基于条件随机场模型的汉语功能块自动标注

Automatic Labeling of Chinese Functional Chunks Based on Conditional Random Fields Model

摘要

摘要: 汉语组块分析是将汉语句子中的词首先组合成基本块，进一步组合形成句子的功能块，最终形成一个具有层次组合结构的汉语句法描述结构.将汉语功能块的自动标注问题看作序列标注任务，并使用词和基本块作为标注单元分别建立标注模型.针对不同的标注模型，分别构建基本块层面的特征集合，并使用条件随机场模型进行汉语功能块的自动标注.实验数据来自清华大学TCT语料库，并且按照8∶2的比例切分形成训练集和测试集.实验结果表明，与仅使用词层面信息的标注模型相比，基本块特征信息的适当加入可以显著提高功能块标注性能.当使用人工标注的基本块信息时，汉语功能块自动标注的准确率达到88.47%，召回率达到89.93%，F值达到89.19%.当使用自动标注的基本块信息时，汉语功能块的标注的准确率为84.27%，召回率为85.57%，F值为84.92%.

Abstract: In the schema of Chinese chunking, the words are firstly combined into base-chunks, and then the base-chunks are further combined into functional chunks, and finally formalized into a hierarchical syntactic structure. In this paper, the problem of automatic labeling of Chinese functional chunks is modeled as a sequential labeling task, and then words and base chunks are regarded as labeling units of the Chinese functional chunk labeling models. For each of the labeling models a series of new features on the level of base-chunks are constructed, and conditional random fields model is employed in the model. The data set in the experiments is selected from Tsinghua Chinese Treebank (TCT) corpus, and split into train set and test set according to the proportion of 8∶2. The experimental results show that in comparison with the model in which the feature set at word level is only used, the addition of some base-chunk features can significantly improve the performance of functional chunk labeling. The proposed functional chunk labeling method based on human-corrected base-chunks can achieve precision of 88.47%, recall of 89.93% and F-measure of 89.19%. When auto-parsed base-chunks are used, the labeling of Chinese functional chunks achieves precision of 84.27%, recall of 85.57% and F-measure of 84.92%.

HTML全文

参考文献(0)

施引文献

资源附件(0)