多文档文摘语义单元自动去噪器的监督学习方法

龚  书  瞿有利  田盛丰

多文档文摘语义单元自动去噪器的监督学习方法

龚书瞿有利田盛丰

Supervised Learning of an Automatic Noisy Semantic Unit Filter for Multi-Document Summarization

Gong Shu, Qu Youli, and Tian Shengfeng

摘要

摘要: 多文档文摘的处理对象是存在噪音的文档集.现有文摘系统一般使用由人工设定阈值的固定阈值去噪器.但通过实验可见，不同文摘算法本身的抗噪能力各有高低，最优阈值随文档集、文摘算法、文本表示方法而改变,人工设定的固定阈值无法达到较好的通用性和去噪效果.为此，提出一种用于生成自动去噪器的监督学习方法，通过从人工文摘中自动获得标注信息，为语义单元提取多个特征，训练语义单元分类器而构成自动去噪器.可通用于不同文本表示所生成的语义单元，在不同多文档文摘系统的预处理阶段为任意文档集自动去除噪音语义单元.实验表明，该监督学习方法所生成的自动去噪器在不同文档集、文摘算法和文本表示方法下具有通用性，较好的去噪性能使各文摘算法的速度及所提取文摘的质量得到不同程度的提升.

Abstract: The target of multi-document summarization is a document set containing many noises. Most of the state-of-art summarization systems use fixed threshold-based noise filter with a manually selected threshold to filter out low frequency units. But according to the observation in experiments, the best threshold varies according to different document sets, summarization algorithms and text representations. These mean that a fixed threshold-based noise filter cannot achieve good robustness in different summarization settings which will lead to an unstable noise filtering efficiency. Therefore, a supervised learning method to generate automatic noise filter is proposed. Based on the labels extracted automatically from human written summaries and a set of selected features which can be used for different types of semantic units, a semantic unit classifier is trained to compose the automatic noise filter, which can be used for different types of semantic unit generated by different text representation methods, and can automatically filter out noisy semantic units at the preprocessing stage of multi-document summarization systems. Experiments show the robustness of the automatic noise filter generated by the supervised learning method under different document sets, summarization algorithms and text representations, and also show the improvements in the speed and summary quality of each summarization algorithms benefited from noise filtering.

HTML全文

参考文献(0)

施引文献

资源附件(0)