一种双层贝叶斯模型：随机森林朴素贝叶斯

张文钧; 蒋良孝; 张欢; 陈龙

doi:10.7544/issn1000-1239.2021.20200521

一种双层贝叶斯模型：随机森林朴素贝叶斯

¹(中国地质大学计算机学院武汉 430074)
²(智能地学信息处理湖北省重点实验室(中国地质大学) 武汉 430074) (wjzhang@cug.edu.cn)

基金项目: 国家自然科学基金联合基金重点项目(U1711267)；中央高校基本科研业务费专项资金项目(CUGGC03)

详细信息

中图分类号: TP391
计量
- 文章访问数: 687
- HTML全文浏览量: 10
- PDF下载量: 227
出版历程
- 发布日期: 2021-08-31

A Two-Layer Bayes Model: Random Forest Naive Bayes

¹(School of Computer Science, China University of Geosciences, Wuhan 430074)
²(Hubei Key Laboratory of Intelligent Geo-Information Processing (China University of Geosciences), Wuhan 430074)

Funds: The work was supported by the Joint Fund Key Projects of the National Natural Science Foundation of China (U1711267) and the Fundamental Research Funds for the Central Universities (CUGGC03).

摘要

摘要: 文本分类是自然语言处理领域的一项基础工作.文本数据的高维性和稀疏性，给文本分类带来了许多问题和挑战.朴素贝叶斯模型因其简单、高效、易理解的特点被广泛应用于文本分类任务，但其属性条件独立假设在现实的文本数据中很难满足，从而影响了它的分类性能.为了削弱朴素贝叶斯的属性条件独立假设，学者们提出了许多改进方法，主要包括结构扩展、实例选择、实例加权、特征选择、特征加权等.然而，所有这些方法都是基于独立的单词特征来构建朴素贝叶斯分类模型，在一定程度上限制了它们的分类性能.为此，尝试用特征学习的方法来改进朴素贝叶斯文本分类模型，提出了一种双层贝叶斯模型：随机森林朴素贝叶斯(random forest naive Bayes, RFNB).RFNB分为2层，第1层利用随机森林从原始的单词特征中学习单词组合的高层特征.然后将学习到的新特征输入第2层，经过一位有效编码后用于构建伯努利朴素贝叶斯模型.在大量广泛使用的文本数据集上的实验结果表明，提出的RFNB模型明显优于现有的最先进的朴素贝叶斯文本分类模型和其他经典的文本分类模型.
- 朴素贝叶斯 /
- 随机森林 /
- 特征学习 /
- 特征表示 /
- 文本分类
Abstract: Text classification is an essential task in natural language processing. The high dimension and sparsity of text data bring many problems and challenges to text classification. Naive Bayes (NB) is widely used in text classification due to its simplicity, efficiency and comprehensibility, but its attribute conditional independence assumption is rarely met in real-world text data and thus affects its classification performance. In order to weaken the attribute conditional independence assumption required by NB, scholars have proposed a variety of improved approaches, mainly including structure extension, instance selection, instance weighting, feature selection, and feature weighting. However, all these approaches construct NB classification models based on the independent term features, which restricts their classification performance to a certain extent. In this paper, we try to improve the naive Bayes text classification model by feature learning and thus propose a two-layer Bayes model called random forest naive Bayes (RFNB). RFNB is divided into two layers. In the first layer, random forest (RF) is used to learn high-level features of term combinations from original term features. Then the learned new features are input into the second layer, which is used to construct a Bernoulli naive Bayes model after one-hot encoding. The experimental results on a large number of widely used text datasets show that the proposed RFNB significantly outperforms the existing state-of-the-art naive Bayes text classification models and other classical text classification models.
- naive Bayes (NB) /
- random forest /
- feature learning /
- feature representation /
- text classification

HTML全文

参考文献(0)

施引文献(17)

期刊类型引用(7)

1.	张淑芬，张宏扬，任志强，陈学斌. 联邦学习的公平性综述. 计算机应用. 2025(01): 1-14 . 百度学术
2.	朱智韬，司世景，王健宗，程宁，孔令炜，黄章成，肖京. 联邦学习的公平性研究综述. 大数据. 2024(01): 62-85 . 百度学术
3.	李锦辉，吴毓峰，余涛，潘振宁. 数据孤岛下基于联邦学习的用户电价响应刻画及其应用. 电力系统保护与控制. 2024(06): 164-176 . 百度学术
4.	刘新，刘冬兰，付婷，王勇，常英贤，姚洪磊，罗昕，王睿，张昊. 基于联邦学习的时间序列预测算法. 山东大学学报(工学版). 2024(03): 55-63 . 百度学术
5.	赵泽华，梁美玉，薛哲，李昂，张珉. 基于数据质量评估的高效强化联邦学习节点动态采样优化. 智能系统学报. 2024(06): 1552-1561 . 百度学术
6.	杨秀清，彭长根，刘海，丁红发，汤寒林. 基于数据质量评估的公平联邦学习方案. 计算机与数字工程. 2022(06): 1278-1285 . 百度学术
7.	黎志鹏. 高可靠的联邦学习在图神经网络上的聚合方法. 工业控制计算机. 2022(10): 85-87+90 . 百度学术