Ji Yimu, Zhang Yongpan, Lang Xianbo, Zhang Dianchao, Wang Ruchuan. Parallel of Decision Tree Classification Algorithm for Stream Data[J]. Journal of Computer Research and Development, 2017, 54(9): 1945-1957. DOI: 10.7544/issn1000-1239.2017.20160554
Citation:
Ji Yimu, Zhang Yongpan, Lang Xianbo, Zhang Dianchao, Wang Ruchuan. Parallel of Decision Tree Classification Algorithm for Stream Data[J]. Journal of Computer Research and Development, 2017, 54(9): 1945-1957. DOI: 10.7544/issn1000-1239.2017.20160554
Ji Yimu, Zhang Yongpan, Lang Xianbo, Zhang Dianchao, Wang Ruchuan. Parallel of Decision Tree Classification Algorithm for Stream Data[J]. Journal of Computer Research and Development, 2017, 54(9): 1945-1957. DOI: 10.7544/issn1000-1239.2017.20160554
Citation:
Ji Yimu, Zhang Yongpan, Lang Xianbo, Zhang Dianchao, Wang Ruchuan. Parallel of Decision Tree Classification Algorithm for Stream Data[J]. Journal of Computer Research and Development, 2017, 54(9): 1945-1957. DOI: 10.7544/issn1000-1239.2017.20160554
1(School of Computer Science and Technology, Nanjing University of Posts and Telecommunications, Nanjing 210023)
2(Jiangsu High Technology Research Key Laboratory for Wireless Sensor Networks (Nanjing University of Posts and Telecommunications), Nanjing 210023)
3(Institute of Advanced Technology, Nanjing University of Posts and Telecommunications, Nanjing 210023)
4(Key Laboratory of Intelligent Perception and Systems for High-Dimensional Information (Nanjing University of Science and Technology), Ministry of Education, Nanjing 210094)
With the rise of cloud computing, Internet of things and other technologies, streaming data exists widely in telecommunications, Internet, finance and other fields as a new form of big data. Compared with the traditional static data, stream data in big data has the characters of rapidness, continuity and changing with time. At the same time, the implicit distribution of the data stream will bring about the concept drift problem. In order to satisfy the requirements of stream data classification algorithms in big data, we must improve the traditional static offline data classification algorithms, and propose P-HT parallel algorithm based on distributed computing platform Storm. To meet the requirements of Storm stream processing platform, we improve the flexibility and versatility of the algorithm through sliding window mechanism, alternative tree mechanism and parallel processing mechanism, and the algorithm can adapt to the concept-drift of data stream very well. Finally, we experimentally verify the validity and high efficiency of the algorithm. The results show that the improved P-HT algorithm has better throughput and faster processing speed than the traditional C45 algorithm in the case of no reduction in accuracy.