高级检索

    一种直推式多标记文档分类方法

    A Transductive Multi-Label Text Categorization Approach

    • 摘要: 真实世界的文档往往同时属于多个类别,因此,利用多标记学习技术进行文档分类是一个重要的研究方向. 现有多标记文档分类方法需要利用大量有正确分类标记的文档才能获得好的分类性能,然而,在实际应用中往往只能得到少量的有标记文档作为分类所需的训练文档. 出于利用未标记文档的想法,提出一种基于随机游走的直推式多标记文档分类方法,可以利用大量的未标记文档来辅助提高分类性能. 实验结果表明,该方法的性能优于现有直推式多标记分类方法CNMF.

       

      Abstract: Real-world text documents usually belong to multiple classes simultaneously, and therefore, using multi-label learning technique to classify text documents is an important research direction. Existing multi-label text categorization approaches usually require using a large amount of documents with correct class labels to achieve good performance. However, in real applications it is often the case that only a small number of labeled documents can be obtained as training samples because of human and material resources. As there are a large amount of unlabeled documents that can be readily obtained, exploiting the unlabeled documents automatically become a basic motivation of this work. Random walk is a popular technique used in semi-supervised learning as well as in transductive learning. In this paper, the authors propose a random walk based transductive multi-label text categorization approach, which is able to exploit abundant unlabeled documents to help improve classification performance. In the proposed approach, labels are spread from the labeled documents to the unlabeled documents. Thus, a small number of labeled documents and a large amount of unlabeled documents are utilized simultaneously in the process of learning. Experimental results show that compared with the existing semi-supervised multi-label method CNMF(constrained non-negative matrix factorization), the proposed approach has a better performance.

       

    /

    返回文章
    返回