ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2017, Vol. 54 ›› Issue (9): 1945-1957.doi: 10.7544/issn1000-1239.2017.20160554

• 人工智能 • 上一篇    下一篇

面向流数据的决策树分类算法并行化

季一木1,2,3,4,张永潘1,郎贤波1,张殿超1,王汝传1,2   

  1. 1(南京邮电大学计算机学院 南京 210023);2(江苏省无线传感网高技术研究重点实验室(南京邮电大学) 南京 210023);3(南京邮电大学先进技术研究院 南京 210023);4(高维信息智能感知与系统教育部重点实验室(南京理工大学) 南京 210094) (jiym@njupt.edu.cn)
  • 出版日期: 2017-09-01
  • 基金资助: 
    国家自然科学基金项目(61170065);江苏省自然科学基金优秀青年基金项目(BK20170100);国家重点研发计划(2017YFB0202200);江苏省重点研发计划项目(BE2017166)

Parallel of Decision Tree Classification Algorithm for Stream Data

Ji Yimu1,2,3,4, Zhang Yongpan1, Lang Xianbo1, Zhang Dianchao1, Wang Ruchuan1,2   

  1. 1(School of Computer Science and Technology, Nanjing University of Posts and Telecommunications, Nanjing 210023);2(Jiangsu High Technology Research Key Laboratory for Wireless Sensor Networks (Nanjing University of Posts and Telecommunications), Nanjing 210023);3(Institute of Advanced Technology, Nanjing University of Posts and Telecommunications, Nanjing 210023);4(Key Laboratory of Intelligent Perception and Systems for High-Dimensional Information (Nanjing University of Science and Technology), Ministry of Education, Nanjing 210094)
  • Online: 2017-09-01

摘要: 随着云计算、物联网等技术的兴起,流数据作为一种新型的大数据形态广泛存在于电信、互联网、金融等领域.与传统静态数据相比,大数据环境下的流数据具有快速、连续和随时间变化等特点.同时数据流的隐含分布变化会带来概念漂移问题.为了适应大数据环境下流数据分类算法的要求,必须对传统的静态离线数据分类算法进行改进,提出基于分布式计算平台Storm的P-HT并行化算法.算法在满足Storm流处理平台要求基础上,通过滑动窗口机制、替代子树机制和并行化处理,提高了算法的灵活性和通用性,并且能良好地适应数据流的概念漂移.最后通过实验验证该算法的有效性和高效性,结果表明在与传统C4.5算法相比精度没有降低的情况下,改进的P-HT算法具有更大的吞吐量和更快的处理速度.

关键词: 流数据, 分类算法, Storm平台, 滑动窗口, C4.5算法, 并行化算法

Abstract: With the rise of cloud computing, Internet of things and other technologies, streaming data exists widely in telecommunications, Internet, finance and other fields as a new form of big data. Compared with the traditional static data, stream data in big data has the characters of rapidness, continuity and changing with time. At the same time, the implicit distribution of the data stream will bring about the concept drift problem. In order to satisfy the requirements of stream data classification algorithms in big data, we must improve the traditional static offline data classification algorithms, and propose P-HT parallel algorithm based on distributed computing platform Storm. To meet the requirements of Storm stream processing platform, we improve the flexibility and versatility of the algorithm through sliding window mechanism, alternative tree mechanism and parallel processing mechanism, and the algorithm can adapt to the concept-drift of data stream very well. Finally, we experimentally verify the validity and high efficiency of the algorithm. The results show that the improved P-HT algorithm has better throughput and faster processing speed than the traditional C45 algorithm in the case of no reduction in accuracy.

Key words: stream data, classification algorithms, Storm platform, sliding windows, C4.5 algorithm, paralleling algorithm

中图分类号: