• 中国精品科技期刊
  • CCF推荐A类中文期刊
  • 计算领域高质量科技期刊T1类
高级检索

一种双层贝叶斯模型:随机森林朴素贝叶斯

张文钧, 蒋良孝, 张欢, 陈龙

张文钧, 蒋良孝, 张欢, 陈龙. 一种双层贝叶斯模型:随机森林朴素贝叶斯[J]. 计算机研究与发展, 2021, 58(9): 2040-2051. DOI: 10.7544/issn1000-1239.2021.20200521
引用本文: 张文钧, 蒋良孝, 张欢, 陈龙. 一种双层贝叶斯模型:随机森林朴素贝叶斯[J]. 计算机研究与发展, 2021, 58(9): 2040-2051. DOI: 10.7544/issn1000-1239.2021.20200521
Zhang Wenjun, Jiang Liangxiao, Zhang Huan, Chen Long. A Two-Layer Bayes Model: Random Forest Naive Bayes[J]. Journal of Computer Research and Development, 2021, 58(9): 2040-2051. DOI: 10.7544/issn1000-1239.2021.20200521
Citation: Zhang Wenjun, Jiang Liangxiao, Zhang Huan, Chen Long. A Two-Layer Bayes Model: Random Forest Naive Bayes[J]. Journal of Computer Research and Development, 2021, 58(9): 2040-2051. DOI: 10.7544/issn1000-1239.2021.20200521

一种双层贝叶斯模型:随机森林朴素贝叶斯

基金项目: 国家自然科学基金联合基金重点项目(U1711267);中央高校基本科研业务费专项资金项目(CUGGC03)
详细信息
  • 中图分类号: TP391

A Two-Layer Bayes Model: Random Forest Naive Bayes

Funds: The work was supported by the Joint Fund Key Projects of the National Natural Science Foundation of China (U1711267) and the Fundamental Research Funds for the Central Universities (CUGGC03).
  • 摘要: 文本分类是自然语言处理领域的一项基础工作.文本数据的高维性和稀疏性,给文本分类带来了许多问题和挑战.朴素贝叶斯模型因其简单、高效、易理解的特点被广泛应用于文本分类任务,但其属性条件独立假设在现实的文本数据中很难满足,从而影响了它的分类性能.为了削弱朴素贝叶斯的属性条件独立假设,学者们提出了许多改进方法,主要包括结构扩展、实例选择、实例加权、特征选择、特征加权等.然而,所有这些方法都是基于独立的单词特征来构建朴素贝叶斯分类模型,在一定程度上限制了它们的分类性能.为此,尝试用特征学习的方法来改进朴素贝叶斯文本分类模型,提出了一种双层贝叶斯模型:随机森林朴素贝叶斯(random forest naive Bayes, RFNB).RFNB分为2层,第1层利用随机森林从原始的单词特征中学习单词组合的高层特征.然后将学习到的新特征输入第2层,经过一位有效编码后用于构建伯努利朴素贝叶斯模型.在大量广泛使用的文本数据集上的实验结果表明,提出的RFNB模型明显优于现有的最先进的朴素贝叶斯文本分类模型和其他经典的文本分类模型.
    Abstract: Text classification is an essential task in natural language processing. The high dimension and sparsity of text data bring many problems and challenges to text classification. Naive Bayes (NB) is widely used in text classification due to its simplicity, efficiency and comprehensibility, but its attribute conditional independence assumption is rarely met in real-world text data and thus affects its classification performance. In order to weaken the attribute conditional independence assumption required by NB, scholars have proposed a variety of improved approaches, mainly including structure extension, instance selection, instance weighting, feature selection, and feature weighting. However, all these approaches construct NB classification models based on the independent term features, which restricts their classification performance to a certain extent. In this paper, we try to improve the naive Bayes text classification model by feature learning and thus propose a two-layer Bayes model called random forest naive Bayes (RFNB). RFNB is divided into two layers. In the first layer, random forest (RF) is used to learn high-level features of term combinations from original term features. Then the learned new features are input into the second layer, which is used to construct a Bernoulli naive Bayes model after one-hot encoding. The experimental results on a large number of widely used text datasets show that the proposed RFNB significantly outperforms the existing state-of-the-art naive Bayes text classification models and other classical text classification models.
  • 期刊类型引用(11)

    1. 张卓伦,袁帅鹏,李铁克,张文新. 基于两级决策树模型的轧制时间预测方法. 计算机集成制造系统. 2025(01): 197-210 . 百度学术
    2. 王敏,王涛,叶志勇. 脉冲电磁阀的集成分类器故障诊断方法. 液压与气动. 2024(03): 174-180 . 百度学术
    3. 蔡增玉,韩洋,张建伟,江楠,冯媛. 基于SnowNLP的微博网络舆情分析系统. 科学技术与工程. 2024(13): 5457-5464 . 百度学术
    4. 孟祥福,任全莹,杨东燊,李可千,姚克宇,朱彦. 基于BERT和CNN的药物不良反应个例报道文献分类方法. 计算机科学. 2024(S1): 1116-1121 . 百度学术
    5. 柴旭清,乔一航,范黎林. 一种基于随机森林分类器构建高性能应用程序性能分析模型的方法. 计算机工程与科学. 2024(07): 1218-1228 . 百度学术
    6. 邬伟骏,吴江波,周强,姜文兵. 基于贝叶斯推理的风电机组风轮偏航协同智能控制方法. 可再生能源. 2024(09): 1205-1210 . 百度学术
    7. 吕慧,段素芬. 基于深度学习的学位论文质量评价分析. 电子技术. 2023(04): 118-120 . 百度学术
    8. 徐苗,王慧玲,梁义,綦小龙,高阳. 一种基于两步搜索策略的K2改进算法. 计算机科学. 2023(09): 303-310 . 百度学术
    9. 陈晓姗,张国华. 基于朴素贝叶斯的大数据模糊随机挖掘仿真. 计算机仿真. 2023(11): 428-432 . 百度学术
    10. 孔德越,程默,颜颖,吕晓艳. 基于铁路旅客常住地与行程环的年度出行特征分析体系. 中国铁道科学. 2022(05): 132-145 . 百度学术
    11. 胡立伟,吕一帆,赵雪亭,薛宇,张成杰,雷国庆,刘凡. 基于数据驱动的交通事故伤害程度影响因素及其耦合关系研究. 交通运输系统工程与信息. 2022(05): 117-124+134 . 百度学术

    其他类型引用(28)

计量
  • 文章访问数:  676
  • HTML全文浏览量:  9
  • PDF下载量:  224
  • 被引次数: 39
出版历程
  • 发布日期:  2021-08-31

目录

    /

    返回文章
    返回