ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2017, Vol. 54 ›› Issue (10): 2255-2267.doi: 10.7544/issn1000-1239.2017.20170456

• 信息安全 • 上一篇    下一篇



  1. (西安电子科技大学网络与信息安全学院 西安 710071) (
  • 出版日期: 2017-10-01
  • 基金资助: 

An Intrusion Detection Scheme Based on Semi-Supervised Learning and Information Gain Ratio

Xu Mengfan, Li Xinghua, Liu Hai, Zhong Cheng, Ma Jianfeng   

  1. (School of Cyber Engineering, Xidian Universality, Xi’an 710071)
  • Online: 2017-10-01

摘要: 针对现有未知攻击检测方法仅定性选取特征而导致检测精度较低的问题,提出一种基于半监督学习和信息增益率的入侵检测方案.利用目标网络在遭受攻击时反应在底层重要网络流量特征各异的特点,在模型训练阶段,为了克服训练数据集规模有限的问题,采用半监督学习算法利用少量标记数据获得大规模的训练数据集;在模型检测阶段,引入信息增益率定量分析不同特征对检测性能的影响程度,最大程度地保留了特征信息,以提高模型对未知攻击的检测性能.实验结果表明:该方案能够利用少量标记数据定量分析目标网络中未知攻击的重要网络流量特征并进行检测,其针对不同目标网络中未知攻击检测的准确率均达到90%以上.

关键词: 入侵检测, 未知攻击, 特征选取, 半监督学习, 信息增益率

Abstract: State-of-the-art intrusion detection schemes for unknown attacks employ machine learning techniques to identify anomaly features within network traffic data. However, due to the lack of enough training set, the difficulty of selecting features quantitatively and the dynamic change of unknown attacks, the existing schemes cannot detect unknown attacks effectually. To address this issue, an intrusion detection scheme based on semi-supervised learning and information gain ratio is proposed. In order to overcome the limited problem of training set in the training period, the semi-supervised learning algorithm is used to obtain large-scale training set with a small amount of labelled data. In the detection period, the information gain ratio is introduced to determine the impact of different features and weight voting to infer the final output label to identify unknown attacks adaptively and quantitatively, which can not only retain the information of features at utmost, but also adjust the weight of single decision tree adaptively against dynamic attacks. Extensive experiments indicate that the proposed scheme can quantitatively analyze the important network traffic features of unknown attacks and detect them by using a small amount of labelled data with no less than 91% accuracy and no more than 5% false negative rate, which have obvious advantages over existing schemes.

Key words: intrusion detection, unknown attacks, feature selection, semi-supervised learning, information gain ratio