高级检索
    王建会, 王洪伟, 申 展, 胡运发. 一种实用高效的文本分类算法[J]. 计算机研究与发展, 2005, 42(1): 85-93.
    引用本文: 王建会, 王洪伟, 申 展, 胡运发. 一种实用高效的文本分类算法[J]. 计算机研究与发展, 2005, 42(1): 85-93.
    Wang Jianhui, Wang Hongwei, Shen Zhan, Hu Yunfa. A Simple and Efficient Algorithm to Classify a Large Scale of Texts[J]. Journal of Computer Research and Development, 2005, 42(1): 85-93.
    Citation: Wang Jianhui, Wang Hongwei, Shen Zhan, Hu Yunfa. A Simple and Efficient Algorithm to Classify a Large Scale of Texts[J]. Journal of Computer Research and Development, 2005, 42(1): 85-93.

    一种实用高效的文本分类算法

    A Simple and Efficient Algorithm to Classify a Large Scale of Texts

    • 摘要: 在模式识别研究领域已有的分类算法中,大多数都是基于向量空间模型的算法,其中使用范围最广的是kNN算法.但是,其中的大多数算法都因为计算复杂度太高而不适用于大规模的场合.而且,当训练样本集增大时都需要重新生成分类器,可扩展性差.为此,提出了互依赖和等效半径的概念,并将两者相结合,提出新的分类算法——基于互依赖和等效半径、易更新的分类算法SECTILE. SECTILE计算复杂度较低,而且扩展性能较好,适用于大规模场合.将SECTILE算法应用于中文文本分类,并与kNN算法和类中心向量法进行比较,结果表明,在提高分类精度的同时,SECTILE还可以大幅度提高分类速度,有利于对大规模信息样本进行实时在线的自动分类.

       

      Abstract: Most of classifying methods are based on VSM (vector space model) in the research on classification at present, of which the widely-used method is kNN (k-nearest neighbors). But most of them are highly complicated on computation, and cannot be used on the occasion of classifying a large number of specimen. Moreover, to them, the classifier must be rebuilt when to increment the corpora of the training specimen. So they have tough scalability. Two new concepts, MD (mutual dependence) and ER (equivalent radius), are put forward in this paper. Furthermore, a new classifying method, SECTILE, is offered. SECTILE can be used to classify a large number of specimen and has good scalability. Later, SECTILE is applied to classify Chinese documents and compared to kNN and CCC method. As a result, SECTILE outperforms kNN and CCC method, and can be used online to classify a large number of specimen while the precision and recall of classification are kept.

       

    /

    返回文章
    返回