一种针对异常点的自适应回归特征选择方法

郭亚庆; 王文剑; 苏美红

doi:10.7544/issn1000-1239.2019.20190313

摘要: 数据集中含有不相关特征和冗余特征会使学习任务难度提高，特征选择可以有效解决该问题，从而提高学习效率和学习器性能.现有的特征选择方法大多针对分类问题，面向回归问题的较少，特别是当数据集含异常点时，现有方法对异常点敏感.虽然某些方法可以通过给样本损失函数加权来提高其稳健性，但是其权值一般都已预先设定好，且在特征选择和学习器训练过程中固定不变，因此方法的自适应性不强.针对上述问题，提出了一种针对异常点的回归特征选择方法(adaptive weight LASSO, AWLASSO)，它首先根据回归系数更新样本误差，并通过自适应正则项将误差大于当前阈值的样本的损失函数赋予较小权重，误差小于阈值的样本的损失函数赋予较大权重，再在更新权重后的加权损失函数下重新估计回归系数，不断迭代上述过程.AWLASSO算法采用阈值来控制样本是否参与回归系数的估计，在阈值作用下，误差较小的样本才可参与估计，所以迭代完成后会获得较优的回归系数估计.另外，AWLASSO算法的阈值不是固定不变的，而是不断增大的(为使初始回归系数估计值较准确，其初始值较小)，这样误判为异常点的样本可以重新进入训练集，并保证训练集含有足够的样本.对于误差大于最大阈值的样本点，由于其学习代价较大，算法将其识别为异常点，令其损失函数权重为0，从而有效降低了异常点的影响.在构造数据和标准数据上的实验结果表明：对于含有异常点的数据集，提出的方法比经典方法具有更好的稳健性和稀疏性.

Abstract: Irrelevant and redundant features embedded in data will raise the difficulty for learning tasks, and feature selection can solve this problem effectively and improve learning efficiency and learner performance. Most of existing feature selection approaches are proposed for classification problems, while there are few studies on regression problems. Eespecially in presence of outliers, the present methods do not perform well. Although some methods can increase their robustness by weighting sample loss functions, the weights are set in advance and fixed throughout feature selection and learner training, which leads to bad adaptability. This paper proposes a regression feature selection method named adaptive weight LASSO (AWLASSO) for outliers. Firstly, it updates sample errors according to regression coefficients. Then the weights for loss functions of all samples are set according to the adaptive regularization term, i.e., the loss functions of samples whose errors are larger than current threshold are set smaller weights and loss functions of samples whose errors are less than threshold are set larger weights. The regression coefficient will be estimated iteratively under weighted loss function whose weights are updated. AWLASSO controls whether samples participate in regression coefficient estimation by the threshold. Only those samples with small errors participate in estimation, so a better regression coefficient estimation may be obtained in the end. In addition, the error threshold of AWLASSO algorithm is not fixed but increasing(To make initial regression coefficient estimation be accurate, initial threshold is often smaller). So some samples which are misjudged as outliers will have chance to be added again in training set. The AWLASSO regards samples whose errors are larger than the maximum threshold as outliers for their learning cost is bigger, and the weights of their loss functions are set to 0. Hence, the influence of outliers will be reduced. Experiment results on artificial data and benchmark datasets demonstrate that the proposed AWLASSO has better robustness and sparsity specially for datasets with outliers in comparison with classical methods.

一种针对异常点的自适应回归特征选择方法

An Adaptive Regression Feature Selection Method for Datasets with Outliers