高级检索

    基于领域词典的文本特征表示

    Text Representation Using Domain Dictionary

    • 摘要: 为提高文本分类性能,提出一种结合机器学习和领域词典的文本特征表示方法.基于领域词典的文本特征表示方法可以增强文本特征表示能力,并降低文本特征空间维数,但是领域词典存在覆盖度不足的问题.为此,提出一种学习模型——自划分模型——来解决这个覆盖度不足的问题.实验结果表明,采用基于自划分模型的领域特征属性作为文本特征,可以提高文本分类性能,特别是特征数目少的情况下,该方法表现出很好的分类效果.相对于传统词文本特征方法,在特征数为500时分类的F1值提高6.58%.

       

      Abstract: In this paper an approach to improving the performance of text categorization is presented by using machine learning technique and domain-dictionary. Domain-dictionary based text representation can enhance the ability of text feature expression and reduce the feature dimensionality. But the size of domain dictionary is limited; some words are not included in domain dictionary, so a machine learning technique named self-partition model is proposed to resolve it. The proposed model can automatically map the words to domain features. Then a text categorization system is developed that uses these learned domain features as text features. The experimental results show that the proposed approach can improve the performance of text categorization. And it can provide high accuracy when the size of feature set is small. When the number of features is 500, it yields 6.58%F1 over the system based on BOW.

       

    /

    返回文章
    返回