一种小样本数据的特征选择方法

许行; 张凯; 王文剑

doi:10.7544/issn1000-1239.2018.20170748

一种小样本数据的特征选择方法

许行¹,
张凯¹,
王文剑^1,2

¹(山西大学计算机与信息技术学院太原 030006)
²(计算智能与中文信息处理教育部重点实验室(山西大学) 太原 030006) (xuh102@126.com)

基金项目: 国家自然科学基金项目(61673249)；山西省回国留学人员科研基金项目(2016-004)；赛尔网络下一代互联网技术创新项目(NGII20170601)

详细信息

中图分类号: TP181
计量
- 文章访问数: 2083
- HTML全文浏览量: 7
- PDF下载量: 816
出版历程
- 发布日期: 2018-09-30

A Feature Selection Method for Small Samples

¹(School of Computer and Information Technology, Shanxi University, Taiyuan 030006)
²(Key Laboratory of Computational Intelligence and Chinese Information Processing(Shanxi University), Ministry of Education,Taiyuan 030006)

摘要

摘要: 小样本数据由于其特征维数相对于样本数目较多，且常包含不相关或冗余特征，使得常用的机器学习算法处理小样本数据时无法得到好的效果，通过特征选择来降低数据维数是解决该问题的一种有效途径.针对小样本数据，提出一种基于互信息的过滤型特征选择方法，首先定义了基于互信息的特征分组标准，该标准同时考虑特征与类别的相关性和不同特征之间的冗余性，根据该标准对特征分组后，在各组内选出与类别相关性最大的特征构成候选特征子集，保证了算法具有较低的时间复杂度，之后采用Boruta算法，在候选特征子集中自动确定最佳特征子集，从而大幅度降低数据的维数.通过与5种经典的特征选择算法比较，在标准数据集上采用3种分类器的实验结果表明提出的方法选出的特征子集具有较好的运行效率和分类性能.
- 小样本数据 /
- 特征选择 /
- 互信息 /
- 特征分组 /
- 过滤型算法
Abstract: For small samples, the common machine learning algorithms may not obtain good results as the feature dimension of small samples is often larger than the number of samples and some irrelevant or redundant features are often existed. It is an effective way to solve this problem by reducing the feature dimension through feature selection. This paper proposes a filter feature selection method based on mutual information for the small samples. First, the criterion of feature grouping based on the mutual information is defined. Both the correlations between features and the class and the redundancy among different features are considered in this criterion, according to which the features are grouped. Then those features that have maximal correlation with the class in each group will be chosen to compose a candidate feature subset. Meanwhile, it is ensured that the time complexity of this algorithm is low. After that, the feature selection method based on feature grouping is combined with Boruta algorithm to determine the optimal feature subset automatically from the candidate feature subset. In this way, the feature dimension can be reduced greatly. Compared with the five classical feature selection algorithms, experimental results on benchmark data sets demonstrate that the feature subset selected by the proposed method has better classification performance and running efficiency on three kinds of classifiers.
- small samples /
- feature selection /
- mutual information /
- feature grouping /
- filter algorithm

HTML全文

参考文献(0)

施引文献(12)

期刊类型引用(8)

1.	唐旭，张多利，王杰，宋宇鲲. 异构多核处理器多发射动态调度技术研究. 合肥工业大学学报(自然科学版). 2023(05): 632-640 . 百度学术
2.	纪元，郑卫波，王梓. 基于容器的安全接入虚拟化. 计算机与现代化. 2022(09): 106-110+118 . 百度学术
3.	阳勇，孟相如，康巧燕，韩晓阳. 拓扑与资源感知的虚拟网络功能迁移方法. 计算机科学与探索. 2021(11): 2161-2170 . 百度学术
4.	曾理，叶晓舟，王玲芳. DPDK技术应用研究综述. 网络新媒体技术. 2020(02): 1-8 . 百度学术
5.	邓理，吴伟楠，朱正一，陈鸣. DiffSec:一种差别性的智能网络安全服务模型. 计算机研究与发展. 2019(05): 955-966 . 本站查看
6.	房一泉，姚俊，万浩，徐鹏. 教育信息化大平台的构建. 化工高等教育. 2019(03): 38-42+89 . 百度学术
7.	李佑文，褚红健，王志心. 基于网络负载均衡的综合监控系统网关设计. 江苏科技信息. 2019(32): 57-59 . 百度学术
8.	胡洪云，符小周. 基于网络功能虚拟化的高性能负载均衡研究. 佳木斯职业学院学报. 2018(11): 393-394 . 百度学术