ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2005, Vol. 42 ›› Issue (1): 94-101.

• • 上一篇    下一篇

使用最大熵模型进行中文文本分类

李荣陆 王建会 陈晓云 陶晓鹏 胡运发   

  1. (复旦大学计算机与信息技术系 上海 200433) (lironglu@163.net)
  • 出版日期: 2005-01-15

Using Maximum Entropy Model for Chinese Text Categorization

Li Ronglu, Wang Jianhui, Chen Xiaoyun, Tao Xiaopeng, and Hu Yunfa   

  1. (Department of Computing and Information Technology, Fudan University, Shanghai 200433)
  • Online: 2005-01-15

摘要: 随着WWW的迅猛发展,文本分类成为处理和组织大量文档数据的关键技术.由于最大熵模型可以综合观察到各种相关或不相关的概率知识,对许多问题的处理都可以达到较好的结果.但是,将最大熵模型应用在文本分类中的研究却非常少,而使用最大熵模型进行中文文本分类的研究尚未见到.使用最大熵模型进行了中文文本分类.通过实验比较和分析了不同的中文文本特征生成方法、不同的特征数目,以及在使用平滑技术的情况下,基于最大熵模型的分类器的分类性能.并且将其和Bayes,KNN,SVM三种典型的文本分类器进行了比较,结果显示它的分类性能胜于Bayes方法,与KNN和SVM方法相当,表明这是一种非常有前途的文本分类方法.

关键词: 文本分类, 最大熵模型, 特征, N-Gram

Abstract: With the rapid development of World Wide Web, text classification has become the key technology in organizing and processing large amount of document data. Maximum entropy model is a probability estimation technique widely used for a variety of natural language tasks. It offers a clean and accommodable frame to combine diverse pieces of contextual information to estimate the probability of a certain linguistics phenomena. This approach for many tasks of NLP perform near state-of-the-art level, or outperform other competing probability methods when trained and tested under similar conditions. However, relatively little work has been done on applying maximum entropy model to text categorization problems. In addition, no previous work has focused on using maximum entropy model in classifying Chinese documents. Maximum entropy model is used for text categorization. Its categorization performance is compared and analyzed using different approaches for text feature generation, different number of feature and smoothng technique. Moreover, in experiments it is compared to Bayes, KNN and SVM, and it is shown that its performance is higher than Bayes and comparable with KNN and SVM. It is a promising technique for text categorization.

Key words: text classification, maximum entropy model, features, N-Gram