ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2020, Vol. 57 ›› Issue (2): 433-446.doi: 10.7544/issn1000-1239.2020.20190160

• 人工智能 • 上一篇    下一篇

基于混合词向量深度学习模型的DGA域名检测方法

杜 鹏, 丁世飞   

  1. (中国矿业大学计算机科学与技术学院 江苏徐州 221116) (矿山数字化教育部工程研究中心(中国矿业大学) 江苏徐州 221116) (pengdu@cumt.edu.cn)
  • 出版日期: 2020-02-01
  • 基金资助: 
    国家自然科学基金项目(61672522,61976216,61379101);江苏省研究生科研创新计划项目(KYCX19_2196);中国矿业大学研究生科研创新计划项目(KYCX19_2196)

A DGA Domain Name Detection Method Based on Deep Learning Models with Mixed Word Embedding

Du Peng and Ding Shifei   

  1. (School of Computer Science and Technology, China University of Mining and Technology, Xuzhou, Jiangsu 221116) (Engineering Research Center of Mine Digitization (China University of Mining and Technology), Ministry of Education, Xuzhou, Jiangsu 221116)
  • Online: 2020-02-01
  • Supported by: 
    This work was supported by the National Natural Science Foundation of China (61672522, 61976216, 61379101), the Graduate Innovation Fund of Jiangsu Province (KYCX19_2196), and the Postgraduate Research & Practice Innovation Program of China University of Mining and Technology (KYCX19_2196).

摘要: 域名生成算法(domain generation algorithm, DGA)是域名检测中防范僵尸网络攻击的重要手段之一,对于生成威胁情报、阻断僵尸网络命令与控制流量、保障网络安全有重要的实际意义.近年来,DGA域名检测技术从依靠手工提取特征发展到自动提取特征的基于深度学习模型的方法,在DGA域名检测任务中取得了较大的进展.但对于不同僵尸网络家族的DGA域名的多分类任务,由于家族种类多,且各家族域名数据存在不平衡性,因此许多已有的深度学习模型在DGA域名的多分类任务上仍有提高空间.针对以上挑战,设计了基于字符和双字母组级别的混合词向量,以提高域名字符串的信息利用度,并设计了基于混合词向量方法的深度学习模型.最后设计了包含多种对比模型的实验,对混合词向量的有效性进行验证.实验结果表明基于混合词向量的深度学习模型在DGA域名检测与分类任务中相比只基于字符级词向量的模型有更好的分类性能,特别是在小样本的DGA域名类别上的分类性能更优,证明了该模型的有效性.

关键词: 域名生成算法, 混合词向量, 深度学习, 卷积神经网络, 长短期记忆网络

Abstract: DGA domain name detection plays a key role in preventing botnet attacks. It is practically significant in generating threat intelligence, blocking botnet command and control traffic, and maintaining cyber security. In recent years, DGA domain name detection algorithms have made great progress, from the methods using manually-crafted features to the automatically extracting features generated by deep learning methods. Multiple studies have indicated that deep learning methods perform better in DGA detection. However, DGA families are various and domain name data is imbalanced in the multi-class classification of different DGA families. Many existing deep learning models can still be improved. To solve the above problems, a mixed word embedding method is designed, based on character level embedding and bigram level embedding, to improve the information utilization of domain names. The paper also designs a deep learning model using the mixed word embedding method. At the end of the paper, an experiment with multiple comparison models is conducted to test the model. The experimental results show that the model based on the mixed word embedding achieves better performance in DGA domain name detection and multi-class classification tasks compared with the models based on character level embedding, especially in the small DGA families with few samples. The results show the proposed approach is effective.

Key words: domain generation algorithm (DGA), mixed word embedding, deep learning, convolutional neural network (CNN), long short-term memory (LSTM)

中图分类号: