高级检索

    基于滑动窗口的Top-K概率频繁项查询算法研究

    Sliding Window Top-K Frequent Item Query on Uncertain Stream

    • 摘要: 频繁项查询在网络监控、网络入侵检测、关联规则挖掘等方面是一项非常重要的技术.该技术在静态的不确定数据中已经得到了深入的研究.但随着数据流特征和不确定性表现的日益明显,在不确定数据流环境下的查询已经成为一项新的研究课题.因此基于数据流普遍采用的滑动窗口模型,提出了一种高效的概率Top-K频繁项查询算法sTopK-UFI.该算法避免了每次窗口更新都重新计算查询答案,而是利用现有的计算结果进行增量更新,从而减少查询代价.另外,该算法基于窗口中的现有数据对未来可能成为频繁项的元素进行预测,并利用泊松分布计算元素成为频繁项的概率上下界,提出相应的过滤策略,可以显著减少检测数据的数量,提高查询效率.实验结果表明,所提出算法可以有效地减少候选集、降低搜索空间、改善在不确定数据流上的查询性能.

       

      Abstract: Frequent items detection has play an important role in many applications, such as network monitor, network intrusion detection, association rules mining, and so on. While is has been studied intensively in the field of static uncertain data, frequent items detection is still novel in the emerging uncertain data stream field. In this paper, based on the sliding window model, an efficient algorithm(sTopK-UFI) is proposed to detect frequent items on uncertain data stream. In order to reduce the computation cost, the algorithm incrementally updates the exact top-K results upon each window slide and thus eliminates the need for complete recomputation from scratch. On the other hand, according to the current data in the sliding window, items to be frequent in the future are predicated according to the Poisson distribution. By computing the upper and lower bound of the probability, the pruning rules are proposed, which can reduce the candidate set and improve the query efficiently. Extensive experiments on real and synthetic datasets show the sTopK-UFI algorithm is an effective way to solve the problem of frequent items detection on uncertain data stream; it could significantly reduce the candidate set, save the memory space, improve the execution time and meet the requirements of practical applications.

       

    /

    返回文章
    返回