ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2022, Vol. 59 ›› Issue (3): 647-660.doi: 10.7544/issn1000-1239.20200648

• 人工智能 • 上一篇    下一篇

DMFUCP:大规模轨迹数据通用伴随模式分布式挖掘框架

张敬伟1,刘绍建1,杨青2,周娅1   

  1. 1(广西可信软件重点实验室(桂林电子科技大学) 广西桂林 541004);2(广西自动检测技术与仪器重点实验室(桂林电子科技大学) 广西桂林 541004) (gtzjw@hotmail.com)
  • 出版日期: 2022-03-07
  • 基金资助: 
    国家自然科学基金项目(61862013,61662015,U1811264,U1711263);广西自然科学基金项目(2020GXNSFAA159117,2018GXNSFAA281199,2017GXNSFAA198035);广西可信软件重点实验室重点课题(KX202052);广西自动检测技术与仪器重点实验室主任基金项目(YQ19109)

DMFUCP: A Distributed Mining Framework for Universal Companion Patterns on Large-Scale Trajectory Data

Zhang Jingwei1, Liu Shaojian1, Yang Qing2, Zhou Ya1   

  1. 1(Guangxi Key Laboratory of Trusted Software (Guilin University of Electronic Technology), Guilin, Guangxi 541004);2(Guangxi Key Laboratory of Automatic Detecting Technology and Instrument (Guilin University of Electronic Technology), Guilin, Guangxi 541004)
  • Online: 2022-03-07
  • Supported by: 
    This work was supported by the National Natural Science Foundation of China (61862013, 61662015, U1811264, U1711263), the Natural Science Foundation of Guangxi Aotonomous Region of China (2020GXNSFAA159117, 2018GXNSFAA281199, 2017GXNSFAA198035), the Key Project of Guangxi Key Laboratory of Trusted Software (KX202052), and the Foundation of Guangxi Key Laboratory of Automatic Detection Technology and Instrument (YQ19109).

摘要: 广泛应用的移动定位设备方便了用户位置数据的获取,轨迹数据量高速增长.通用伴随模式挖掘聚焦时空维度上的用户高相似度行为路径发现问题,基于大规模轨迹数据设计高效准确地伴随模式挖掘方法对发现用户偏好、构建新商业模式等具有重要意义,同时也极具挑战.一方面,海量且不断增长的轨迹数据要求伴随模式挖掘应具有良好的可扩展性,集中性挖掘策略并不适用.另一方面,现有的分布式挖掘框架在为高效模式挖掘提供高质量数据输入、轨迹数据中大量松散连接的有效处理等方面考虑不足,使得通用伴随模式发现存在改进空间.提出了一个分布式的2阶段通用伴随模式挖掘框架——DMFUCP,其通过嵌入数据预处理优化、松散连接分析优化等,让伴随模式挖掘方法呈现了更好的性能.其中,该框架为数据预处理阶段设了融合运动方向的密度聚类算法DBSCANCD和聚类平衡算法TCB,确保后续挖掘任务获得提供少噪音、高质量的轨迹数据输入;在模式挖掘阶段,该框架设计了G剪枝重划分算法GSPR和分段枚举算法SAE,GSPR使用参数G对长轨迹进行分割,并将分割后的所有分段重划分以改善松散连接的处理效果,SAE负责引入多线程和前向闭包保证挖掘算法的性能.实验证明,相比现有的通用伴随模式挖掘框架,DMFUCP具有更好的通用伴随模式发现能力的同时,将挖掘每组通用伴随模式的时间消耗降低了20%~40%.

关键词: 分布式挖掘框架, 松散连接, 聚类平衡, G剪枝重划分, 分段枚举

Abstract: The popularity of mobile positioning terminals makes users’ locations be easily accessible, which contributes huge amount of trajectory data. Universal companion pattern mining aims at discovering those highly overlapping behavior paths between moving objects in spatio-temporal dimensions, and it is very valuable and challenging to provide effective and efficient pattern mining methods on large-scale trajectories. Obviously, the mining strategy on a centralized environment is incompetent for the consideration of scalability caused by huge and growing trajectory data. Existing distributed mining frameworks are weak in both providing effective input for efficient pattern mining and the processing ability on a large number of loose connections in massive trajectories, which should be covered to improve mining performance. In this study, we propose a distributed two-stage mining framework, DMFUCP, which embeds optimization on data preprocessing and loose connection analysis to provide more efficient and effective universal companion pattern mining. In the data preprocessing stage of DMFUCP, we design both a density clustering algorithm DBSCANCD and a clustering balance algorithm TCB to input high-quality trajectory data with less noisy for mining tasks. In the mining stage of DMFUCP, we propose both a G pruning repartition algorithm GSPR and a segmented enumeration algorithm SAE. GSPR introduces a parameter G to segment long trajectories and then repartitions all segments to improve the processing effectiveness on loose connections. SAE guarantees the mining performance through multithreading and forward closure. Compared with those existing companion pattern mining frameworks on real datasets, DMFUCP reduces the time required to mine each set of universal companion pattern by 20% to 40% while providing better universal companion pattern discovery capabilities.

Key words: distributed mining framework, loose connections, clustering balance, G pruning repartition, segmented enumeration

中图分类号: