Abstract:
With the popularity of social networks, the demand for its frequent subgraph mining becomes more intense. With the arrival of the era of big data, social networks have been expanding and frequent subgraph mining becomes increasingly difficult. In fact, it does not require to mine frequent subgraphs exactly in application, so sampling methods are adopted to improve the efficiency of mining frequent subgraphs under certain accuracy. Most existing sampling algorithms are not fit for frequent subgraph mining because they use vertex transfer or compute the topology of the original graph first which will take a lot of time. In this paper, we propose a new sampling algorithm named DIMSARI (distributed Monte Carlo sampling algorithm based on random jump and graph induction) based on frequent edge, and it runs on a distributed framework named Spark. This algorithm is created on the basis of the Monte Carlo algorithm meanwhile adding random jump. The results are added by subgraph induction step to promote the accuracy of the algorithm and prove that the algorithm is unbiased. The experiments show that the accuracy of frequent subgraph mining using DIMSARI algorithm has been greatly improved and at the same time the proposed algorithm only spends a little more time than other algorithms. The apex of sampling at different sampling rates after subgraphs has maintained a lower normalized mean square error.