ISSN 1000-1239 CN 11-1777/TP

Journal of Computer Research and Development ›› 2017, Vol. 54 ›› Issue (9): 1945-1957.doi: 10.7544/issn1000-1239.2017.20160554

Previous Articles     Next Articles

Parallel of Decision Tree Classification Algorithm for Stream Data

Ji Yimu1,2,3,4, Zhang Yongpan1, Lang Xianbo1, Zhang Dianchao1, Wang Ruchuan1,2   

  1. 1(School of Computer Science and Technology, Nanjing University of Posts and Telecommunications, Nanjing 210023);2(Jiangsu High Technology Research Key Laboratory for Wireless Sensor Networks (Nanjing University of Posts and Telecommunications), Nanjing 210023);3(Institute of Advanced Technology, Nanjing University of Posts and Telecommunications, Nanjing 210023);4(Key Laboratory of Intelligent Perception and Systems for High-Dimensional Information (Nanjing University of Science and Technology), Ministry of Education, Nanjing 210094)
  • Online:2017-09-01

Abstract: With the rise of cloud computing, Internet of things and other technologies, streaming data exists widely in telecommunications, Internet, finance and other fields as a new form of big data. Compared with the traditional static data, stream data in big data has the characters of rapidness, continuity and changing with time. At the same time, the implicit distribution of the data stream will bring about the concept drift problem. In order to satisfy the requirements of stream data classification algorithms in big data, we must improve the traditional static offline data classification algorithms, and propose P-HT parallel algorithm based on distributed computing platform Storm. To meet the requirements of Storm stream processing platform, we improve the flexibility and versatility of the algorithm through sliding window mechanism, alternative tree mechanism and parallel processing mechanism, and the algorithm can adapt to the concept-drift of data stream very well. Finally, we experimentally verify the validity and high efficiency of the algorithm. The results show that the improved P-HT algorithm has better throughput and faster processing speed than the traditional C45 algorithm in the case of no reduction in accuracy.

Key words: stream data, classification algorithms, Storm platform, sliding windows, C4.5 algorithm, paralleling algorithm

CLC Number: