• 中国精品科技期刊
  • CCF推荐A类中文期刊
  • 计算领域高质量科技期刊T1类
高级检索

面向流数据的决策树分类算法并行化

季一木, 张永潘, 郎贤波, 张殿超, 王汝传

季一木, 张永潘, 郎贤波, 张殿超, 王汝传. 面向流数据的决策树分类算法并行化[J]. 计算机研究与发展, 2017, 54(9): 1945-1957. DOI: 10.7544/issn1000-1239.2017.20160554
引用本文: 季一木, 张永潘, 郎贤波, 张殿超, 王汝传. 面向流数据的决策树分类算法并行化[J]. 计算机研究与发展, 2017, 54(9): 1945-1957. DOI: 10.7544/issn1000-1239.2017.20160554
Ji Yimu, Zhang Yongpan, Lang Xianbo, Zhang Dianchao, Wang Ruchuan. Parallel of Decision Tree Classification Algorithm for Stream Data[J]. Journal of Computer Research and Development, 2017, 54(9): 1945-1957. DOI: 10.7544/issn1000-1239.2017.20160554
Citation: Ji Yimu, Zhang Yongpan, Lang Xianbo, Zhang Dianchao, Wang Ruchuan. Parallel of Decision Tree Classification Algorithm for Stream Data[J]. Journal of Computer Research and Development, 2017, 54(9): 1945-1957. DOI: 10.7544/issn1000-1239.2017.20160554

面向流数据的决策树分类算法并行化

基金项目: 国家自然科学基金项目(61170065);江苏省自然科学基金优秀青年基金项目(BK20170100);国家重点研发计划(2017YFB0202200);江苏省重点研发计划项目(BE2017166)
详细信息
  • 中图分类号: TP391

Parallel of Decision Tree Classification Algorithm for Stream Data

  • 摘要: 随着云计算、物联网等技术的兴起,流数据作为一种新型的大数据形态广泛存在于电信、互联网、金融等领域.与传统静态数据相比,大数据环境下的流数据具有快速、连续和随时间变化等特点.同时数据流的隐含分布变化会带来概念漂移问题.为了适应大数据环境下流数据分类算法的要求,必须对传统的静态离线数据分类算法进行改进,提出基于分布式计算平台Storm的P-HT并行化算法.算法在满足Storm流处理平台要求基础上,通过滑动窗口机制、替代子树机制和并行化处理,提高了算法的灵活性和通用性,并且能良好地适应数据流的概念漂移.最后通过实验验证该算法的有效性和高效性,结果表明在与传统C4.5算法相比精度没有降低的情况下,改进的P-HT算法具有更大的吞吐量和更快的处理速度.
    Abstract: With the rise of cloud computing, Internet of things and other technologies, streaming data exists widely in telecommunications, Internet, finance and other fields as a new form of big data. Compared with the traditional static data, stream data in big data has the characters of rapidness, continuity and changing with time. At the same time, the implicit distribution of the data stream will bring about the concept drift problem. In order to satisfy the requirements of stream data classification algorithms in big data, we must improve the traditional static offline data classification algorithms, and propose P-HT parallel algorithm based on distributed computing platform Storm. To meet the requirements of Storm stream processing platform, we improve the flexibility and versatility of the algorithm through sliding window mechanism, alternative tree mechanism and parallel processing mechanism, and the algorithm can adapt to the concept-drift of data stream very well. Finally, we experimentally verify the validity and high efficiency of the algorithm. The results show that the improved P-HT algorithm has better throughput and faster processing speed than the traditional C45 algorithm in the case of no reduction in accuracy.
计量
  • 文章访问数:  1687
  • HTML全文浏览量:  4
  • PDF下载量:  977
  • 被引次数: 0
出版历程
  • 发布日期:  2017-08-31

目录

    /

    返回文章
    返回