Abstract:
Complex mining algorithms over large-scale graphs usually require highly frequent iterative analysis, while distributed computing with scalable computing power and storage capacity is a preferred solution for efficiency. However, the edges freely connecting vertices generate a lot of messages across distributed computing tasks and hence incurs communication costs during iterations, which heavily limits the performance improvement of distributed computing. To alleviate the negative impact of communication bottleneck, existing research basically employs combining and replicating techniques on top of the traditional pushing framework design. However, they mainly focus on easy-to-be-optimized single-dimensional-message algorithms with simple message data structure, and are not suitable for other important multi-dimensional-message algorithms with complex structure. Also, they cannot be seamlessly integrated into the-state-of-the-art pulling framework where messages are generated on demand. We thereby propose a lightweight vertex replication mechanism to synchronize replicated vertices and generate messages based on such replications on demand. The mechanism can work well under the pulling framework with inherent advantages in terms of fault-tolerance and memory consumption, and also greatly optimize the communication costs. Moreover, by considering the communication benefits and the costs incurred by possible workload imbalance, it can select an optimal replication threshold for best performance. Finally, extensive experiments over various real-world graphs validate the effectiveness of the lightweight vertex replication framework and the threshold analysis model.