多分类器集成的汉语词义消歧研究

吴云芳; 王  淼; 金  澎; 俞士汶

多分类器集成的汉语词义消歧研究

Ensembles of Classifiers for Chinese Word Sense Disambiguation

摘要

摘要: 词义消歧长期以来一直是自然语言处理中的热点和难题，集成方法被认为是机器学习研究的四大趋势之一.系统研究了9种集成学习方法在汉语词义消歧中的应用.9种集成方法分别是乘法规则、均值、最大值、最小值、多数投票、序列投票、加权投票、概率加权和单分类器融合，其中乘法规则、均值、最大值3种集成方法还未曾应用于词义消歧.选取支持向量机模型、朴素贝叶斯和决策树作为3个单分类器.在两个不同的数据集上进行了实验，其一是选自现代汉语语义标注语料库的18个多义词，其二是国际语义评测SemEval-2007的中英文对译选择词消歧任务.实验结果显示，首次在词义消歧中引入应用的3种集成方法乘法、均值、最大值有良好的性能表现，3种方法的消歧准确率均高于最佳单分类器SVM，而且优于其他6种集成方法.

Abstract: Word sense disambiguation has long been a central concern for natural language processing, and ensemble of classifiers is one of the four current directions in machine learning study. This paper makes a systematic study on the ensembles of classifiers for Chinese word sense disambiguation. Nine kinds of combining strategies are experimented in this paper: product, average, max, min, majority voting, rank-based voting, weighted voting, weighted probability, and best single combining, among which the three combining methods of product, average and max have not been applied in word sense disambiguation in previous works. Support vector machine, nave Bayes, and decision tree are selected as the three component classifiers. Four kinds of features are used in all of the three classifiers: bag of words, words with position, parts of speech with position and 2-gram collocations. Experiments are conducted in two different datasets: the first dataset is 18 ambiguous words selected from Chinese semantic corpus, and the second dataset is the multilingual Chinese-English lexical sample task at SemEval-2007. The experimental results illustrate that the three kinds of combining strategies of average, product and max, which are applied for the first time in Chinese word sense disambiguation in this paper, exceed the accuracy of best single classifier support vector machine, and also outperform the other six kinds of combining methods.

HTML全文

参考文献(0)

施引文献

资源附件(0)