Algorithm Based on Counting for Mining Frequent Items over Data Stream
-
Graphical Abstract
-
Abstract
Mining frequent items over data stream has drawn great attention, and large amount of efficient algorithms have been proposed by many researchers over the past decades. Although the classical algorithms are well suited to find frequent items, usually they do not perform well when estimating items’ approximate frequency. To solve this problem, we introduce a series of counter-based algorithms called SRoEC (segment rotative efficient count), SReEC (segment reserve efficient count) and RFreq (reserve frequent). They divide the counter used in classical algorithms and define operations for counters to improve the accuracy of item frequency and avoid the effect of low frequency items. As the experience shows, these algorithms can find Top-K items above the threshold correctly and return their approximate frequency as accurate as possible. Both analysis and experiments demonstrate that under same cost of space, these algorithms return higher count accuracy rate, lower frequency error rate and higher frequency reserve rate on both simulated data set and real data set when compared with the two best classical algorithms (frequent algorithm and space saving algorithm) nowadays. Amongst them, RFreq algorithm shows obvious advantages. What’s more, the algorithms perform much better than classical ones when the data distribution is smooth.
-
-