ISSN 1000-1239 CN 11-1777/TP

Journal of Computer Research and Development ›› 2016, Vol. 53 ›› Issue (8): 1792-1805.doi: 10.7544/issn1000-1239.2016.20160140

Special Issue: 2016数据挖掘前沿技术专题

Previous Articles     Next Articles

HSSM: A Hierarchical Method for Streaming Submodular Maximization

Zhang Fenxiang, Chen Huahui, Qian Jiangbo,Dong Yihong   

  1. (Faculty of Electrical Engineering and Computer Science, Ningbo University, Ningbo, Zhejiang 315211)
  • Online:2016-08-01

Abstract: How to extract k elements from a large data stream according to some utility functions can be reduced to maximizing a submodular set function. The traditional algorithms had already given some good solutions of summarizing a static data set by submodular method, well-known as standard greedy algorithm. Lastly researches also presented some new algorithms with corresponding distributed solutions to meet the streaming data access and the real-time response limits, but those algorithms are not good enough in maximizing the utility gain. In this paper, we propose a new algorithm called HSSM which runs as a pipelining distributed framework and requires only a single-pass access the data set. Finally, the utility gain of our solution is close to the option standard greedy solution. We also propose using utility vector to compress the set of summarization and filtering out the low gain objects to improve the original HSSM algorithm. Fortunately, this approach can get applications in many related fields such as representative articles selection, k-medoid select problem and so on. Experimental results show that the improved Spark-HSSM+ method can increase the summarization speed in direct proportion to k\+2 in contrast to the traditional method. Compared with other distributed algorithms, the result of the Spark-HSSM+ method is the most close to the standard greedy solution.

Key words: streaming submodular maximization, hierarchy model, pipelining parallelism, data summarization, Spark distribution platform

CLC Number: