Abstract:
The rapid development of the Internet has made a variety of Web applications and Web data, which become the major source of data for lots of research. Web page includes a variety of content, such as advertising, navigation bar, related links, text, etc. However, for different studies and applications, not all content is necessary; oppositely, the unrelated content will affect the effectiveness and efficiency of the research and applications. So Web page cleaning is a highlighted topic of information retrieval with booming search engines. Thus it is necessary to sum up the field on the page denoise, in order to better carry out in-depth study. Firstly, this paper gives a brief introduction to the necessity of Web page cleaning and its related concepts. The authors present a classification hierarchy of the Web page cleaning methods, including the single-model based Web page cleaning methods and the multi-model based Web page cleaning methods. Then, this paper summarizes all kinds of Web page cleaning techniques and frameworks, including SST, Shingle, Pagelet, DSE, etc. Thirdly, this paper describes the experimental datasets and experimental methods used in all kinds of Web page cleaning techniques. Finally, this paper discusses the existing problems and the future directions in the Web page cleaning field.