Citation: | Ma Lanjihong, Zhao Peng, Zhou Zhihua. Robust Heavy-Tailed Linear Bandits Algorithm[J]. Journal of Computer Research and Development, 2023, 60(6): 1385-1395. DOI: 10.7544/issn1000-1239.202220279 |
The linear bandits model is one of the most foundational online learning models, where a linear function parametrizes the mean payoff of each arm. The linear bandits model encompasses various applications with strong theoretical guarantees and practical modeling ability. However, existing algorithms suffer from the data irregularity that frequently emerges in real-world applications, as the data are usually collected from open and dynamic environments. In this paper, we are particularly concerned with two kinds of data irregularities: the underlying regression parameter could be changed with time, and the noise might not be bounded or even not sub-Gaussian, which are referred to as model drift and heavy-tailed noise, respectively. To deal with the two hostile factors, we propose a novel algorithm based on upper confidence bound. The median-of-means estimator is used to handle the potential heavy-tailed noise, and the restarting mechanism is employed to tackle the model drift. Theoretically, we establish the minimax lower bound to characterize the difficulty and prove that our algorithm enjoys a no-regret upper bound. The attained results subsume previous analysis for scenarios without either model drift or heavy-tailed noise. Empirically, we additionally design several online ensemble techniques to make our algorithm more adaptive to the environments. Extensive experiments are conducted on synthetic and real-world datasets to validate the effectiveness.
[1] |
Bubeck S, Cesa-Bianchi N. Regret analysis of stochastic and nonstochastic multi-armed bandit problems[J]. Foundations and Trends in Machine Learning, 2012, 5(1): 1−122 doi: 10.1561/2200000024
|
[2] |
Woodroofe M. A one-armed bandit problem with a concomitant variable[J]. Journal of the American Statistical Association, 1979, 74(368): 799−806 doi: 10.1080/01621459.1979.10481033
|
[3] |
Li Lihong, Chu Wei, Langford J, et al. A contextual-bandit approach to personalized news article recommendation [C] //Proc of the 19th Int Conf on World Wide Web (WWW). New York: ACM, 2010: 661–670
|
[4] |
Cont R, Bouchaud J. Herd behavior and aggregate fluctuations in financial markets [J]. Macroeconomic dynamics, 2000, 4(2): 170−196
|
[5] |
Roberts, James A, Tjeerd W, et al. The heavy tail of the human brain[J]. Current Opinion in Neurobiology, 2015, 31: 164−172 doi: 10.1016/j.conb.2014.10.014
|
[6] |
Naoki A, Philip M L. Associative reinforcement learning using linear probabilistic concepts [C] //Proc of the 16th Int Conf on Machine Learning (ICML). San Francisco, CA: Morgan Kaufmann, 1999: 3–11
|
[7] |
Dani V, Hayes T P, Kakade S M. Stochastic linear optimization under bandit feedback [C] //Proc of the 21st Annual Conf on Learning Theory (COLT). San Francisco, CA: Morgan Kaufmann, 2008: 355−366
|
[8] |
Abbasi-Yadkori Y, Pál D, Szepesvári C. Improved algorithms for linear stochastic bandits [C] //Advances in Neural Information Processing Systems 24 (NIPS). Red Hook, NY: Curran Associates, Inc, 2011: 2312–2320
|
[9] |
Abbasi-Yadkori, Y, Pal D, Szepesvari C. Online-to-confidence-set conversions and application to sparse stochastic bandits [C] //Proc of the 15th Int Conf on Artificial Intelligence and Statistics (AISTATS). San Francisco, CA: Morgan Kaufmann, 2012: 1–9
|
[10] |
Agrawal S, Goyal N. Thompson sampling for contextual bandits with linear payoffs [C] //Proc of the 30th Int Conf on Machine Learning (ICML). San Francisco, CA: Morgan Kaufmann, 2013: 127–135
|
[11] |
Soare M, Lazaric A, Munos R. Best-Arm identification in linear bandits [C] //Advances in Neural Information Processing Systems 27 (NIPS). Red Hook, NY: Curran Associates, Inc, 2014: 828–836
|
[12] |
Li Yingkai, Wang Yining, Zhou Yuan. Nearly minimax-optimal regret for linearly parameterized bandits [C] //Proc of the 32nd Conf on Learning Theory (COLT). San Francisco, CA: Morgan Kaufmann, 2019: 2173–2174
|
[13] |
Zhou Yuan, Song Shihong, Zhang Huishuai, et al. Regularized OFU: An efficient UCB estimator for non-linear contextual bandit [J]. arXiv preprint, arXiv: 2106.15128, 2021
|
[14] |
Lattimore T, Szepesvári C. Bandit Algorithms [M]. Cambridge, UK: Cambridge University Press, 2019
|
[15] |
Dietterich T G. Steps toward robust artificial intelligence[J]. AI Magazine, 2017, 38(3): 3−24 doi: 10.1609/aimag.v38i3.2756
|
[16] |
Dietterich T G. Robust artificial intelligence and robust human organizations[J]. Frontiers Computer Science, 2019, 13(1): 1−3 doi: 10.1007/s11704-018-8900-4
|
[17] |
Zhou Zhihua. Learnware: On the future of machine learning[J]. Frontiers of Computer Science, 2016, 10(4): 589−590 doi: 10.1007/s11704-016-6906-3
|
[18] |
Zhao Peng, Wang Guanghui, Zhang Lijun, et al. Bandit convex optimization in non-stationary environments[J]. Journal of Machine Learning Research, 2020, 22(125): 1−45
|
[19] |
Chen Wei, Wang Liwei, Zhao Haoyu, et al. Combinatorial semi-bandit in the non-stationary environment [C] //Proc of the 37th Conf on Uncertainty in Artificial (UAI). San Francisco, CA: Morgan Kaufmann, 2021: 865−875
|
[20] |
Zhang Lijun, Lu Shiyin, Zhou Zhihua. Adaptive online learning in dynamic environments [C] //Advances in Neural Information Processing Systems 31 (NeurIPS). Red Hook, NY: Curran Associates, Inc, 2018: 1330–1340
|
[21] |
Zhao Peng, Zhang Yujie, Zhang Lijun, et al. Dynamic regret of convex and smooth functions [C] // Advances in Neural Information Processing Systems 33 (NeurIPS). Red Hook, NY: Curran Associates, Inc, 2020: 12510–12520
|
[22] |
Zhao Peng, Wang Yuxiang, Zhou Zhihua. Non-stationary online learning with memory and non-stochastic control [C] //Proc of the 25th Int Conf on Artificial Intelligence and Statistics (AISTATS). San Francisco, CA: Morgan Kaufmann, 2022: 2101−2133
|
[23] |
Wei Chenyu, Luo Haipeng. Non-stationary reinforcement learning without prior knowledge: An optimal black-box approach [C] //Proc of the 34th Conf on Learning Theory (COLT). San Francisco, CA: Morgan Kaufmann, 2021: 4300–4354
|
[24] |
Zhao Peng, Cai Lewen, Zhou Zhihua. Handling concept drift via model reuse[J]. Machine Learning, 2020, 109(3): 533−568 doi: 10.1007/s10994-019-05835-w
|
[25] |
Cheung W C, Simchi-Levi D, Zhu Ruaihao. Learning to optimize under non-stationarity [C] //Proc of the 22nd Int Conf on Artificial Intelligence and Statistics (AISTATS). San Francisco, CA: Morgan Kaufmann, 2019: 1079–1087
|
[26] |
Guo Lei, Ljung L, Priouret P. Performance analysis of the forgetting factor RLS algorithm[J]. International Journal of Adaptive Control and Signal Processing, 1993, 7(6): 525−538 doi: 10.1002/acs.4480070604
|
[27] |
Zhao Peng, Wang Xinqiang, Xie Siyu, et al. Distribution-free one-pass learning[J]. IEEE Transaction on Knowledge and Data Engineering, 2021, 33(3): 951−963
|
[28] |
Russac Y, Vernade C, Cappé O. Weighted linear bandits for non-stationary environments [C] //Advances in Neural Information Processing Systems 32 (NeurIPS). Red Hook, NY: Curran Associates, Inc, 2019: 12040–12049
|
[29] |
Zhao Peng, Zhang Lijun, Jiang Yuan, et al. A simple approach for non-stationary linear bandits [C] //Proc of the 23rd Int Conf on Artificial Intelligence and Statistics (AISTATS). San Francisco, CA: Morgan Kaufmann, 2020: 746–755
|
[30] |
Liu Keqin, Zhao Qing. Multi-armed bandit problems with heavy-tailed reward distributions [C] //Proc of the 49th Annual Allerton Conf on Communication, Control, and Computing. Piscataway, NJ: IEEE, 2011: 485–492
|
[31] |
Bubeck S, Cesa-Bianchi N, Lugosi G. Bandits with heavy tail[J]. IEEE Transactions on Information Theory, 2013, 59(11): 7711−7717 doi: 10.1109/TIT.2013.2277869
|
[32] |
Vakili S, Liu Keqin, Zhao Qing. Deterministic sequencing of exploration and exploitation for multi-armed bandit problems[J]. IEEE Journal of Selected Topics in Signal Processing, 2013, 7(5): 759−767 doi: 10.1109/JSTSP.2013.2263494
|
[33] |
Hsu D, Sabato S. Heavy-tailed regression with a generalized median-of-means [C] //Proc of the 31st Int Conf on Machine Learning (ICML). San Francisco, CA: Morgan Kaufmann, 2014: 37–45
|
[34] |
Audibert J Y, Catoni O. Robust linear least squares regression[J]. The Annals of Statistics, 2011, 39(5): 2766−2794
|
[35] |
Medina A M, Yang S. No-regret algorithms for heavy-tailed linear bandits [C] //Proc of the 33rd Int Conf on Machine Learning (ICML). San Francisco, CA: Morgan Kaufmann, 2016: 1642–1650
|
[36] |
Shao Han, Yu Xiaotian, King I, et al. Almost optimal algorithms for linear stochastic bandits with heavytailed payoffs [C] //Advances in Neural Information Processing Systems 31 (NeurIPS). Red Hook, NY: Curran Associates, Inc, 2018: 8420–8429
|
[37] |
Xue Bo, Wang Guanghui, Wang Yimu, et al. Nearly optimal regret for stochastic linear bandits with heavy-tailed payoffs [C] //Proc of the 29th Int Joint Conf on Artificial Intelligence (IJCAI). San Francisco, CA: Morgan Kaufmann, 2020: 2936−2942
|
[38] |
Yu Jiayuan, Mannor S. Piecewise-stationary bandit problems with side observations [C] //Proc of the 26th Annual Int Conf on Machine Learning (ICML). San Francisco, CA: Morgan Kaufmann, 2009: 1177–1184
|
[39] |
Benedetto D G, Bellini V, Zappella G. A linear bandit for seasonal environments [J]. arXiv preprint, arXiv: 2004.13576, 2020
|
[40] |
Asuncion A, Newman D. UCI machine learning repository[CP/OL]. 2007[2022-06-21].https://archive.ics.uci.edu/ml/index.php
|
[41] |
Cesa-Bianchi N, Gentile C, Zappella G. A gang of bandits [C] //Advances in Neural Information Processing Systems 26 (NIPS). Red Hook, NY: Curran Associates, Inc, 2013: 737–745
|
[42] |
Dressel J, Farid H. The accuracy, fairness, and limits of predicting recidivism [J/OL]. Science Advances, 2018 [2022-06-21].https://www.science.org/doi/epdf/10.1126/sciadv.aao5580
|
[43] |
Diemert E, Meynet J, Galland P, et al. Attribution modeling increases efficiency of bidding in display advertising [C/OL] //Proc of the 23rd ACM SIGKDD Conf on Knowledge Discovery and Data Mining (AdKDD). New York: ACM, 2017 [2022-08-23].https://dl.acm.org/doi/abs/10.1145/3124749.3124752
|