Abstract:
The target of multi-document summarization is a document set containing many noises. Most of the state-of-art summarization systems use fixed threshold-based noise filter with a manually selected threshold to filter out low frequency units. But according to the observation in experiments, the best threshold varies according to different document sets, summarization algorithms and text representations. These mean that a fixed threshold-based noise filter cannot achieve good robustness in different summarization settings which will lead to an unstable noise filtering efficiency. Therefore, a supervised learning method to generate automatic noise filter is proposed. Based on the labels extracted automatically from human written summaries and a set of selected features which can be used for different types of semantic units, a semantic unit classifier is trained to compose the automatic noise filter, which can be used for different types of semantic unit generated by different text representation methods, and can automatically filter out noisy semantic units at the preprocessing stage of multi-document summarization systems. Experiments show the robustness of the automatic noise filter generated by the supervised learning method under different document sets, summarization algorithms and text representations, and also show the improvements in the speed and summary quality of each summarization algorithms benefited from noise filtering.