面向回归任务的数值型标签噪声过滤算法

姜高霞; 王文剑

doi:10.7544/issn1000-1239.20220053

面向回归任务的数值型标签噪声过滤算法

A Numerical Label Noise Filtering Algorithm for Regression Task

摘要

摘要: 回归任务中的数值型标签噪声可能误导模型训练，进而弱化模型泛化能力.作为一种常用的标签噪声处理技术，噪声过滤通过去除误标记样本来降低噪声水平，但无法保证过滤后模型能够获得更好的泛化表现.一些过滤算法过于关注噪声水平，以至于大量无噪样本也被去除.尽管已有样本过滤框架能够平衡样本去除量和噪声水平，但其形式过于复杂不利于直观理解和实际应用.根据无噪回归任务中的学习理论提出了面向数值型标签噪声数据的泛化误差界，从而明确了影响模型泛化能力的关键数据因素(数据量和噪声水平).在此基础上提出一种可解释的噪声过滤框架，其目标是以较小的样本去除代价最大程度地降低噪声水平.针对噪声估计问题，从理论上分析了噪声与覆盖区间关键指标(中心和半径)之间的变化趋势，进而构建了相对噪声估计方法.此方法与所提框架结合形成了相对噪声过滤(relative noise filtering, RNF)算法.在标准数据集和年龄估计数据上均验证了算法的有效性.实验结果表明：该算法能够适应各类噪声数据，显著提升模型泛化能力.在年龄估计数据上RNF算法检测出一些标签噪声数据，有效提升了数据质量和模型预测性能.

Abstract: Numerical label noise in regression may misguide the model training and weaken the generalization ability. As a popular technique, noise filtering could reduce the noise level by removing mislabeled samples, but it could rarely ensure a better generalization performance. Some filters care about the noise level so much that many noise-free samples are also removed. Although the existing sample selection framework could balance the number of removals and the noise level, it is too complicated to be understood intuitively and applied in reality. A generalization error bound is proposed for data with numerical label noise according to the learning theory in the noise-free regression task. It clarifies the key data factors, including data size and noise level, that affect the generalization ability. On this basis, an interpretable noise filtering framework is proposed, the goal of which is to minimize the noise level with a low cost of sample removal. Meanwhile, the relationship between noise and key indicators (center and radius) of the covering interval is theoretically analyzed for noise estimation. Then a relative noise estimator is proposed. The relative noise filtering (RNF) algorithm is designed by integrating the proposed framework with the estimator. The effectiveness of RNF is verified on the benchmark datasets and age estimation dataset. Experimental results show that RNF can be adapted to various types of noises and significantly improve the generalization ability of the regression model. On the age estimation dataset, RNF detects some samples with label noises. It effectively improves the data quality and model prediction performance.

HTML全文

参考文献(0)

施引文献

资源附件(0)