Abstract:
Numerical label noise in regression may misguide the model training and weaken the generalization ability. As a popular technique, noise filtering could reduce the noise level by removing mislabeled samples, but it could rarely ensure a better generalization performance. Some filters care about the noise level so much that many noise-free samples are also removed. Although the existing sample selection framework could balance the number of removals and the noise level, it is too complicated to be understood intuitively and applied in reality. A generalization error bound is proposed for data with numerical label noise according to the learning theory in the noise-free regression task. It clarifies the key data factors, including data size and noise level, that affect the generalization ability. On this basis, an interpretable noise filtering framework is proposed, the goal of which is to minimize the noise level with a low cost of sample removal. Meanwhile, the relationship between noise and key indicators (center and radius) of the covering interval is theoretically analyzed for noise estimation. Then a relative noise estimator is proposed. The relative noise filtering (RNF) algorithm is designed by integrating the proposed framework with the estimator. The effectiveness of RNF is verified on the benchmark datasets and age estimation dataset. Experimental results show that RNF can be adapted to various types of noises and significantly improve the generalization ability of the regression model. On the age estimation dataset, RNF detects some samples with label noises. It effectively improves the data quality and model prediction performance.