ISSN 1000-1239 CN 11-1777/TP

Journal of Computer Research and Development ›› 2020, Vol. 57 ›› Issue (2): 433-446.doi: 10.7544/issn1000-1239.2020.20190160

Previous Articles     Next Articles

A DGA Domain Name Detection Method Based on Deep Learning Models with Mixed Word Embedding

Du Peng and Ding Shifei   

  1. (School of Computer Science and Technology, China University of Mining and Technology, Xuzhou, Jiangsu 221116) (Engineering Research Center of Mine Digitization (China University of Mining and Technology), Ministry of Education, Xuzhou, Jiangsu 221116)
  • Online:2020-02-01
  • Supported by: 
    This work was supported by the National Natural Science Foundation of China (61672522, 61976216, 61379101), the Graduate Innovation Fund of Jiangsu Province (KYCX19_2196), and the Postgraduate Research & Practice Innovation Program of China University of Mining and Technology (KYCX19_2196).

Abstract: DGA domain name detection plays a key role in preventing botnet attacks. It is practically significant in generating threat intelligence, blocking botnet command and control traffic, and maintaining cyber security. In recent years, DGA domain name detection algorithms have made great progress, from the methods using manually-crafted features to the automatically extracting features generated by deep learning methods. Multiple studies have indicated that deep learning methods perform better in DGA detection. However, DGA families are various and domain name data is imbalanced in the multi-class classification of different DGA families. Many existing deep learning models can still be improved. To solve the above problems, a mixed word embedding method is designed, based on character level embedding and bigram level embedding, to improve the information utilization of domain names. The paper also designs a deep learning model using the mixed word embedding method. At the end of the paper, an experiment with multiple comparison models is conducted to test the model. The experimental results show that the model based on the mixed word embedding achieves better performance in DGA domain name detection and multi-class classification tasks compared with the models based on character level embedding, especially in the small DGA families with few samples. The results show the proposed approach is effective.

Key words: domain generation algorithm (DGA), mixed word embedding, deep learning, convolutional neural network (CNN), long short-term memory (LSTM)

CLC Number: