ISSN 1000-1239 CN 11-1777/TP

• •

### 使用最大熵模型进行中文文本分类

1. (复旦大学计算机与信息技术系 上海 200433) (lironglu@163.net)
• 出版日期: 2005-01-15

### Using Maximum Entropy Model for Chinese Text Categorization

Li Ronglu, Wang Jianhui, Chen Xiaoyun, Tao Xiaopeng, and Hu Yunfa

1. (Department of Computing and Information Technology, Fudan University, Shanghai 200433)
• Online: 2005-01-15

Abstract: With the rapid development of World Wide Web, text classification has become the key technology in organizing and processing large amount of document data. Maximum entropy model is a probability estimation technique widely used for a variety of natural language tasks. It offers a clean and accommodable frame to combine diverse pieces of contextual information to estimate the probability of a certain linguistics phenomena. This approach for many tasks of NLP perform near state-of-the-art level, or outperform other competing probability methods when trained and tested under similar conditions. However, relatively little work has been done on applying maximum entropy model to text categorization problems. In addition, no previous work has focused on using maximum entropy model in classifying Chinese documents. Maximum entropy model is used for text categorization. Its categorization performance is compared and analyzed using different approaches for text feature generation, different number of feature and smoothng technique. Moreover, in experiments it is compared to Bayes, KNN and SVM, and it is shown that its performance is higher than Bayes and comparable with KNN and SVM. It is a promising technique for text categorization.