Abstract:
Sampling is an efficient and most widely-used approximation technique. It enables lots of algorithms to be applied to huge dataset by use of scaling down dramatically dataset for data mining and streaming mining. Throughout the detailed review, a kind of taxonomic frame of sampling algorithms based on uniform sampling and biased sampling is presented; meanwhile, analysis, comparisons and evaluations of representative sampling algorithms such as reservoir sampling, concise sampling, count sampling, chain-sampling, DV sampling and so on are performed. Due to the limitations of uniform sampling in some applications—queries with relatively low selectivity, outlier detection in large multidimensional data sets, and clustering over data streams with skewed Zipf distribution, the importance of need for using biased sampling methods in these scenarios is fully dissertated. In addition to listing successful applications of sampling techniques in data mining, statistics estimating and stream mining up to now, we survey the application and development of sampling techniques, especially those traditional classic sampling techniques such as progressive sampling, adaptive sampling, stratified sampling and two-phase sampling etc. Finally, future challenges and directions with respect to data stream sampling are further discussed.