• 中国精品科技期刊
  • CCF推荐A类中文期刊
  • 计算领域高质量科技期刊T1类
Advanced Search
Hu Wenyu, Sun Zhihui, Wu Yingjie. Study of Sampling Methods on Data Mining and Stream Mining[J]. Journal of Computer Research and Development, 2011, 48(1): 45-54.
Citation: Hu Wenyu, Sun Zhihui, Wu Yingjie. Study of Sampling Methods on Data Mining and Stream Mining[J]. Journal of Computer Research and Development, 2011, 48(1): 45-54.

Study of Sampling Methods on Data Mining and Stream Mining

More Information
  • Published Date: January 14, 2011
  • Sampling is an efficient and most widely-used approximation technique. It enables lots of algorithms to be applied to huge dataset by use of scaling down dramatically dataset for data mining and streaming mining. Throughout the detailed review, a kind of taxonomic frame of sampling algorithms based on uniform sampling and biased sampling is presented; meanwhile, analysis, comparisons and evaluations of representative sampling algorithms such as reservoir sampling, concise sampling, count sampling, chain-sampling, DV sampling and so on are performed. Due to the limitations of uniform sampling in some applications—queries with relatively low selectivity, outlier detection in large multidimensional data sets, and clustering over data streams with skewed Zipf distribution, the importance of need for using biased sampling methods in these scenarios is fully dissertated. In addition to listing successful applications of sampling techniques in data mining, statistics estimating and stream mining up to now, we survey the application and development of sampling techniques, especially those traditional classic sampling techniques such as progressive sampling, adaptive sampling, stratified sampling and two-phase sampling etc. Finally, future challenges and directions with respect to data stream sampling are further discussed.
  • Related Articles

    [1]Wang Xiujun, Mo Lei, Zheng Xiao, Wei Linna, Dong Jun, Liu Zhi, Guo Longkun. Sampling Based Fast Publishing Algorithm with Differential Privacy for Data Stream[J]. Journal of Computer Research and Development, 2024, 61(10): 2433-2447. DOI: 10.7544/issn1000-1239.202440481
    [2]Lei Xiangxin, Yang Zhiying, Huang Shaoyin, Hu Yunfa. Mining Frequent Subtree on Paging XML Data Stream[J]. Journal of Computer Research and Development, 2012, 49(9): 1926-1936.
    [3]Zhu Ranwei, Wang Peng, and Liu Majin. Algorithm Based on Counting for Mining Frequent Items over Data Stream[J]. Journal of Computer Research and Development, 2011, 48(10): 1803-1811.
    [4]Xu Guang, An Hong, Xu Mu, Liu Gu, Yao Ping, Ren Yongqing, and Wang Fang. The Architecture and the Programming Model of a Data-Flow-Like Driven Tiled Stream Processor[J]. Journal of Computer Research and Development, 2010, 47(9): 1643-1653.
    [5]Yang Bei, Huang Houkuan. Mining Top-K Significant Itemsets in Landmark Windows over Data Streams[J]. Journal of Computer Research and Development, 2010, 47(3): 463-473.
    [6]Chen Huahui, Shi Baile. Wavelet-Based Amnesic Synopses for Data Streams[J]. Journal of Computer Research and Development, 2009, 46(2): 268-279.
    [7]Yang Bei, Huang Houkuan. Research on an Algorithm for Approximate Quantile Computation over Data Streams[J]. Journal of Computer Research and Development, 2008, 45(2): 287-292.
    [8]Wang Tao, Li Zhoujun, Yan Yuejin, Chen Huowang. A Survey of Classification of Data Streams[J]. Journal of Computer Research and Development, 2007, 44(11): 1809-1815.
    [9]Yang Xuemei, Dong Yisheng, Xu Hongbing, Liu Xuejun, Qian Jiangbo, Wang Yongli. Online Correlation Analysis for Multiple Dimensions Data Streams[J]. Journal of Computer Research and Development, 2006, 43(10): 1744-1750.
    [10]Liu Xuejun, Xu Hongbing, Dong Yisheng, Wang Yongli, Qian Jiangbo. Mining Frequent Patterns in Data Streams[J]. Journal of Computer Research and Development, 2005, 42(12): 2192-2198.

Catalog

    Article views (1793) PDF downloads (923) Cited by()

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return