ISSN 1000-1239 CN 11-1777/TP

• Paper •

### A Simple and Efficient Algorithm to Classify a Large Scale of Texts

Wang Jianhui1, Wang Hongwei1,2, Shen Zhan1, and Hu Yunfa1

1. 1(Department of Computing and Information Technology, Fudan University, Shanghai 200433) 2(School of Economics and Management, Tongji University, Shanghai 20433)
• Online:2005-01-15

Abstract: Most of classifying methods are based on VSM (vector space model) in the research on classification at present, of which the widely-used method is kNN (k-nearest neighbors). But most of them are highly complicated on computation, and cannot be used on the occasion of classifying a large number of specimen. Moreover, to them, the classifier must be rebuilt when to increment the corpora of the training specimen. So they have tough scalability. Two new concepts, MD (mutual dependence) and ER (equivalent radius), are put forward in this paper. Furthermore, a new classifying method, SECTILE, is offered. SECTILE can be used to classify a large number of specimen and has good scalability. Later, SECTILE is applied to classify Chinese documents and compared to kNN and CCC method. As a result, SECTILE outperforms kNN and CCC method, and can be used online to classify a large number of specimen while the precision and recall of classification are kept.

Key words: classification, MD, ER, VSM, SECTILE