Abstract:
DGA domain name detection plays a key role in preventing botnet attacks. It is practically significant in generating threat intelligence, blocking botnet command and control traffic, and maintaining cyber security. In recent years, DGA domain name detection algorithms have made great progress, from the methods using manually-crafted features to the automatically extracting features generated by deep learning methods. Multiple studies have indicated that deep learning methods perform better in DGA detection. However, DGA families are various and domain name data is imbalanced in the multi-class classification of different DGA families. Many existing deep learning models can still be improved. To solve the above problems, a mixed word embedding method is designed, based on character level embedding and bigram level embedding, to improve the information utilization of domain names. The paper also designs a deep learning model using the mixed word embedding method. At the end of the paper, an experiment with multiple comparison models is conducted to test the model. The experimental results show that the model based on the mixed word embedding achieves better performance in DGA domain name detection and multi-class classification tasks compared with the models based on character level embedding, especially in the small DGA families with few samples. The results show the proposed approach is effective.