ISSN 1000-1239 CN 11-1777/TP

Journal of Computer Research and Development

Previous Articles     Next Articles

A Two-Layer Bayes Model: Random Forest Naive Bayes

Zhang Wenjun1, Jiang Liangxiao1,2 , Zhang Huan1,  Chen Long1   

  1. 1School of Computer Science, China University of Geosciences, Wuhan 430074

    2Hubei Key Laboratory of Intelligent Geo-Information Processing, China University of Geosciences, Wuhan 430074

  • Supported by: 
    The work was supported by the National Natural Science Foundation of China (U1711267) and the Fundamental Research Funds for the Central Universities (CUGGC03).

Abstract: Text classification is an essential task in natural language processing. The high dimension and sparsity of text data bring many problems and challenges to text classification. Naive Bayes (NB) is widely used in text classification due to its simplicity, efficiency and comprehensibility, but its attribute conditional independence assumption is rarely met in real-world text data and thus affects its classification performance. In order to weaken the attribute conditional independence assumption required by NB, scholars have proposed a variety of improved approaches, mainly including structure extension, instance selection, instance weighting, feature selection, and feature weighting. However, all these approaches construct NB classification models based on the independent term features, which restricts their classification performance to a certain extent. In this paper, we try to improve the naive Bayes text classification model by feature learning and thus propose a two-layer Bayes model called random forest naive Bayes (RFNB). RFNB is divided into two layers. In the first layer, random forest (RF) is used to learn high-level features of term combinations from original term features. Then the learned new features are input into the second layer, which is used to construct a Bernoulli naive Bayes model after one-hot encoding. The experimental results on a large number of widely used text datasets show that the proposed RFNB significantly outperforms the existing state-of-the-art naive Bayes text classification models and other classical text classification models.

Key words: naive Bayes, random forest, feature learning, feature representation, text classification