高级检索
    杨 蓓, 黄厚宽. 数据流上的分位数近似算法研究[J]. 计算机研究与发展, 2008, 45(2): 287-292.
    引用本文: 杨 蓓, 黄厚宽. 数据流上的分位数近似算法研究[J]. 计算机研究与发展, 2008, 45(2): 287-292.
    Yang Bei, Huang Houkuan. Research on an Algorithm for Approximate Quantile Computation over Data Streams[J]. Journal of Computer Research and Development, 2008, 45(2): 287-292.
    Citation: Yang Bei, Huang Houkuan. Research on an Algorithm for Approximate Quantile Computation over Data Streams[J]. Journal of Computer Research and Development, 2008, 45(2): 287-292.

    数据流上的分位数近似算法研究

    Research on an Algorithm for Approximate Quantile Computation over Data Streams

    • 摘要: 数据流是一种新型数据模型,广泛应用于交通流量监控、通信管理、传感器网络、股票分析、Web点击流等众多领域.近年来越来越多的学者关注于数据流上的分位数计算研究.由于流数据的连续、无界、易失等特性,存储完整的流数据信息并得到精确的查询结果几乎是不可能的.在实施查询计算时追求内存用量与查询精度之间的最佳均衡.设计了规范数直方图的概要数据结构以存储流数据的摘要信息,并在此基础上提出了单遍扫描的、联机的分位数近似算法,其时间和空间复杂度均线性于概要结构中桶的个数,而与数据流的长度无关,因而具有很好的可规模性.该方法在均匀分布的数据上取得了优良性能.分析了算法精度与内存需求的关系.实验结果表明该算法具有较精确的查询结果,具备良好的实用性和有效性.

       

      Abstract: Data stream is a new data model that has attracted attentions in numerous applications such as traffic monitoring, telephone records management, sensor networks, stock-market analysis, Web click streams, etc. The importance of quantile estimation of data streams has been highlighted by more and more researchers in recent years. Due to the characteristics of continuity and boundlessness of streaming data, it is unfeasible to memorize the entire information of data streams and thus difficult to get the exact answer to the query on streaming data. In this paper, a novel synopsis data structure—Nord_Histogram for storing streaming data summary is designed to get a balance between the memory cost and the query accuracy, and a one-pass online approximate algorithm for quantile computation is presented. The algorithm implements the approximate quantile queries over data stream with the time and space requirements being linear with the number of the buckets, which has no business with the length of data streams, and therefore has great scalability. This method has very good performance on data with uniform distribution. The correlation between the computation accuracy and main memory requirement is also analyzed. Experimental results show that the algorithm brings about quite small relative errors and works well over data streams.

       

    /

    返回文章
    返回