Fusion of Clustering Trigger-Pair Features for POS Tagging Based on Maximum Entropy Model
-
Graphical Abstract
-
Abstract
Part-of-speech (POS) information is demanded before constructing more complex analysis. Traditional POS tagger is based on hidden Markov model (HMM), however the HMM can't include the long-distance lexical features which can help to predict the right POS. A kind of “W\-A→W\-B?T\-B” trigger-pair, which contains the long-distance lexical information, is proposed to solve this problem firstly, and then a better correlation measure—average mutual information (AMI) instead of mutual information (MI) is used to extract trigger pairs from the training corpus. To cope with the sparseness problem of trigger word “W\-A”, word clustering is made to build clustering trigger-pairs by semantic similarity calculation which is provided by the vector space model. Finally, the high-quality clustering trigger-pairs are added to the POS tagging system as a new kind of features under the maximum entropy frame-work. The experiment shows that tagging error of the new model is reduced by 34%, compared with the HMM. The idea of the paper can be applied to Pinyin-to-character conversion and word sense disambiguation problem too.
-
-