高级检索
    赵 岩 王晓龙 刘秉权 关 毅. 融合聚类触发对特征的最大熵词性标注模型[J]. 计算机研究与发展, 2006, 43(2): 268-274.
    引用本文: 赵 岩 王晓龙 刘秉权 关 毅. 融合聚类触发对特征的最大熵词性标注模型[J]. 计算机研究与发展, 2006, 43(2): 268-274.
    Zhao Yan, Wang Xiaolong, Liu Bingquan, and Guan Yi. Fusion of Clustering Trigger-Pair Features for POS Tagging Based on Maximum Entropy Model[J]. Journal of Computer Research and Development, 2006, 43(2): 268-274.
    Citation: Zhao Yan, Wang Xiaolong, Liu Bingquan, and Guan Yi. Fusion of Clustering Trigger-Pair Features for POS Tagging Based on Maximum Entropy Model[J]. Journal of Computer Research and Development, 2006, 43(2): 268-274.

    融合聚类触发对特征的最大熵词性标注模型

    Fusion of Clustering Trigger-Pair Features for POS Tagging Based on Maximum Entropy Model

    • 摘要: 为解决传统HMM词性标注模型不能包含远距离词特征的问题,提出了形如“W\-A→W\-B?T\-B”的触发对来承载远距离词特征信息,并采用平均互信息量度对触发对特征进行选择.在最大熵框架下,将选择后的触发对特征加入到词性标注系统中.利用矢量空间模型提供的语义相似度计算功能进行词语聚类,聚类的结果和语义词典融合,建立聚类触发对特征用来解决触发词“W\-A”的数据稀疏问题.实验结果表明,与HMM相比,融合了聚类触发对特征的最大熵模型标注错误率减少了34%.

       

      Abstract: Part-of-speech (POS) information is demanded before constructing more complex analysis. Traditional POS tagger is based on hidden Markov model (HMM), however the HMM can't include the long-distance lexical features which can help to predict the right POS. A kind of “W\-A→W\-B?T\-B” trigger-pair, which contains the long-distance lexical information, is proposed to solve this problem firstly, and then a better correlation measure—average mutual information (AMI) instead of mutual information (MI) is used to extract trigger pairs from the training corpus. To cope with the sparseness problem of trigger word “W\-A”, word clustering is made to build clustering trigger-pairs by semantic similarity calculation which is provided by the vector space model. Finally, the high-quality clustering trigger-pairs are added to the POS tagging system as a new kind of features under the maximum entropy frame-work. The experiment shows that tagging error of the new model is reduced by 34%, compared with the HMM. The idea of the paper can be applied to Pinyin-to-character conversion and word sense disambiguation problem too.

       

    /

    返回文章
    返回