ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2016, Vol. 53 ›› Issue (8): 1792-1805.doi: 10.7544/issn1000-1239.2016.20160140

所属专题: 2016数据挖掘前沿技术专题

• 人工智能 • 上一篇    下一篇

HSSM:一种流数据分层次模最大化方法

张奋翔,陈华辉,钱江波,董一鸿   

  1. (宁波大学信息科学与工程学院 浙江宁波 315211) (zhang_fenxiang@163.com)
  • 出版日期: 2016-08-01
  • 基金资助: 
    国家自然科学基金项目(61572266,61472194)

HSSM: A Hierarchical Method for Streaming Submodular Maximization

Zhang Fenxiang, Chen Huahui, Qian Jiangbo,Dong Yihong   

  1. (Faculty of Electrical Engineering and Computer Science, Ningbo University, Ningbo, Zhejiang 315211)
  • Online: 2016-08-01

摘要: 从大规模数据中“摘要”出最能满足效用函数收益的有限个数据对象,可以被归纳为次模函数最大化问题.并行过滤算法在满足流数据访问次数限制与实时响应的条件下,通过分布式筛选的方式实现次规模最大化,但在提升摘要速率时效用函数收益损失较大.提出一种流数据分层次模最大化算法HSSM,在仅访问一次数据集的条件下,采用流水并行的分布式处理框架得到接近于标准贪心算法的次模函数收益,同时改进HSSM通过累积摘要的压缩存储、分层过滤低增益对象提升摘要速率.该方法在数据摘要问题的相关领域具有广泛的应用性,如文档集中代表性文章的选取、数据集中心点选取等.实验结果显示,分布式算法Spark-HSSM+对比于传统的算法在运行速率上达到与摘要规模k成k\+2正比例关系的提升.而相对于其他分布式算法,其实验效用收益与理论最差收益都更接近于贪心算法.

关键词: 流次模最大化, 分层模型, 流水并行, 数据摘要, Spark分布式平台

Abstract: How to extract k elements from a large data stream according to some utility functions can be reduced to maximizing a submodular set function. The traditional algorithms had already given some good solutions of summarizing a static data set by submodular method, well-known as standard greedy algorithm. Lastly researches also presented some new algorithms with corresponding distributed solutions to meet the streaming data access and the real-time response limits, but those algorithms are not good enough in maximizing the utility gain. In this paper, we propose a new algorithm called HSSM which runs as a pipelining distributed framework and requires only a single-pass access the data set. Finally, the utility gain of our solution is close to the option standard greedy solution. We also propose using utility vector to compress the set of summarization and filtering out the low gain objects to improve the original HSSM algorithm. Fortunately, this approach can get applications in many related fields such as representative articles selection, k-medoid select problem and so on. Experimental results show that the improved Spark-HSSM+ method can increase the summarization speed in direct proportion to k\+2 in contrast to the traditional method. Compared with other distributed algorithms, the result of the Spark-HSSM+ method is the most close to the standard greedy solution.

Key words: streaming submodular maximization, hierarchy model, pipelining parallelism, data summarization, Spark distribution platform

中图分类号: