A Sampling Algorithm Based on Frequent Edges in Single Large-Scale Graph Under Spark
-
摘要: 随着社交网络的流行,对其进行频繁子图挖掘的需求越来越强烈.大数据时代的到来,社交网络规模不断扩大,频繁子图挖掘工作变得愈发困难.在实际应用中,往往并不需要精确地挖掘出频繁子图,采样的方法在保证一定准确率的前提下能够显著提高频繁子图挖掘的效率.现有采样算法大多是根据节点的度进行采样,不适用于频繁子图挖掘.提出了一种基于频繁边的采样算法DIMSARI(distributed Monte Carlo sampling algorithm based on random jump and graph induction),在蒙特卡罗算法的基础上增加了根据频繁边进行随机跳的操作,并对其结果进行了图感应操作,进一步增加了算法的准确性,并在理论上证明了该方法的无偏性.实验结果显示:使用DIMSARI算法采样后进行频繁子图挖掘,准确性比现有其他的采样算法有较大的提高,在不同的采样率下采样后的子图的节点度都保持更小的归一化均方偏差.Abstract: With the popularity of social networks, the demand for its frequent subgraph mining becomes more intense. With the arrival of the era of big data, social networks have been expanding and frequent subgraph mining becomes increasingly difficult. In fact, it does not require to mine frequent subgraphs exactly in application, so sampling methods are adopted to improve the efficiency of mining frequent subgraphs under certain accuracy. Most existing sampling algorithms are not fit for frequent subgraph mining because they use vertex transfer or compute the topology of the original graph first which will take a lot of time. In this paper, we propose a new sampling algorithm named DIMSARI (distributed Monte Carlo sampling algorithm based on random jump and graph induction) based on frequent edge, and it runs on a distributed framework named Spark. This algorithm is created on the basis of the Monte Carlo algorithm meanwhile adding random jump. The results are added by subgraph induction step to promote the accuracy of the algorithm and prove that the algorithm is unbiased. The experiments show that the accuracy of frequent subgraph mining using DIMSARI algorithm has been greatly improved and at the same time the proposed algorithm only spends a little more time than other algorithms. The apex of sampling at different sampling rates after subgraphs has maintained a lower normalized mean square error.
-
Keywords:
- sample /
- frequent subgraph /
- single large-scale graph /
- frequent edge /
- Spark
-
-
期刊类型引用(11)
1. 周显春,喻佳. 基于图神经网络的人工自然语言语义挖掘仿真. 计算机仿真. 2024(01): 344-348 . 百度学术
2. 孟祥福,田友发,张霄雁. 基于LightGBM模型的肺腺癌免疫相关基因筛选与患者生存率预测. 生物医学工程学杂志. 2024(01): 70-79 . 百度学术
3. 陈伟,周丽华,王亚峰,王丽珍,陈红梅. 异质信息网络中基于解耦图神经网络的社区搜索. 计算机科学. 2024(03): 90-101 . 百度学术
4. 万齐智,万常选,胡蓉,刘德喜,刘喜平,廖国琼. 面向研究问题的深度学习事件抽取综述. 自动化学报. 2024(11): 2079-2101 . 百度学术
5. 刘超,孔兵,杜国王,周丽华,陈红梅,包崇明. 高阶互信息最大化与伪标签指导的深度聚类. 浙江大学学报(工学版). 2023(02): 299-309 . 百度学术
6. 杨成波,周丽华,黄亚群,杨宇迪. 异质网络中基于关键词属性的Truss社区搜索. 计算机应用研究. 2023(06): 1708-1714 . 百度学术
7. 白明昌. 基于折叠路径聚合的属性网络节点嵌入方法. 计算机工程. 2023(07): 76-84 . 百度学术
8. 谢小杰,梁英,王梓森,刘政君. 基于图卷积的异质网络节点分类方法. 计算机研究与发展. 2022(07): 1470-1485 . 本站查看
9. 王宏琳,杨丹,聂铁铮,寇月. 自注意力机制的属性异构信息网络嵌入的商品推荐. 计算机研究与发展. 2022(07): 1509-1521 . 本站查看
10. 盛妍,朱龙珠,丁毛毛,刘鲲鹏,刘海龙. 面向电力服务情绪识别的图卷积网络方法研究. 电子器件. 2022(04): 959-963 . 百度学术
11. 李琳,梁永全,刘广明. 基于重启随机游走的图自编码器. 计算机应用研究. 2021(10): 3009-3013 . 百度学术
其他类型引用(20)
计量
- 文章访问数: 1267
- HTML全文浏览量: 2
- PDF下载量: 843
- 被引次数: 31