Citation: | Shu Wentao, Li Ruixiao, Sun Tianxiang, Huang Xuanjing, Qiu Xipeng. Large Language Models: Principles, Implementation, and Progress[J]. Journal of Computer Research and Development, 2024, 61(2): 351-361. DOI: 10.7544/issn1000-1239.202330303 |
In recent years, the emergence and development of large language models (LLMs) have revolutionized the field of natural language processing and even artificial intelligence. With the increasing number of model parameters and training data, the perplexity of language models decreases in a predictable manner, which implies the improvement of performance on various natural language processing tasks. Therefore, scaling up language models has been a promising way to improve the system intelligence. In this survey, we first review the definition and scope of LLMs and provide a scale standard to distinguish “large” language models from the perspectives of performance and computing. Then, we review the development and representative work of LLMs in three dimensions: data, algorithm, and model architecture, showing how up-scaling in these dimensions drives the development of LLMs at different stages. Next, we discuss the emergent abilities of LLMs and possible interpretations behind them. We highlight three key emergent abilities, i.e., chain-of-thought prompting, in-context learning, and instruction-following, introducing their related advances and applications. Finally, we outline some potential directions and challenges of LLMs.
[1] |
Kaplan J, McCandlish S, Henighan T, et al. Scaling laws for neural language models[J]. arXiv preprint, arXiv: 2001. 08361, 2020
|
[2] |
Brown P F, Della P V J, Desouza P V, et al. Class-based n-gram models of natural language[J]. Computational Linguistics, 1992. 18(4): 467−480
|
[3] |
Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C] //Proc of the 30th Annual Conf on Neural Information Processing Systems. New York: Curran Associates, 2017: 5990−6008
|
[4] |
Lin Tianyang, Wang Yuxin, Liu Xiangyang et al. A survey of Transformers[J]. AI Open, 2021 (3): 111−132
|
[5] |
Wei J, Tay Y, Bommasani R, et al. Emergent abilities of large language models[J]. arXiv preprint, arXiv: 2206. 07682, 2022
|
[6] |
Rajbhandari S, Rasley J, Ruwase O, et al. ZeRo: Memory optimizations toward training trillion parameter models[C]//Proc of: Int Conf for High Performance Computing, Networking, Storage and Analysis (SC20). Piscataway, NJ: IEEE, 2020: 1−16
|
[7] |
Kingma D P, Ba J. Adam: A method for stochastic optimization[J]. arXiv preprint, arXiv: 1412. 6980, 2014
|
[8] |
Hoffmann J, Borgeaud S, Mensch A, et al. Training compute-optimal large language models[J]. arXiv preprint, arXiv: 2203. 15556, 2022
|
[9] |
Brown T, Mann B, Ryder N, et al. Language models are few-shot learners[J]. Advances in Neural Information Processing Systems, 2020. 33, 1877−1901
|
[10] |
Mikolov T, Chen Kai, Corrado G, et al. Efficient estimation of word representations in vector space[J]. arXiv preprint, arXiv: 1301. 3781, 2013
|
[11] |
Radford A, Narasimhan K, Salimans T, et al. Improving language understanding by generative pre-training[DB/OL].https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf, 2018
|
[12] |
Devlin J, Chang M W, Lee K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding[J]. arXiv preprint, arXiv: 1810. 04805, 2018
|
[13] |
Collobert R, Weston J. A unified architecture for natural language processing: Deep neural networks with multitask learning[C]//Proc of the 25th Int Conf on Machine Learning. New York: ACM, 2008: 160−167
|
[14] |
OpenAI. GPT-4 technical report[J]. arXiv preprint, arXiv: 2303. 08774, 2023
|
[15] |
Chung H W, Hou Le, Longpre S, et al. Scaling instruction-finetuned language models[J]. arXiv preprint, arXiv: 2210. 11416, 2022
|
[16] |
Touvron H, Lavril T, Izacard G, et al. LLaMa: Open and efficient foundation language models[J]. arXiv preprint, arXiv: 2302. 13971, 2023
|
[17] |
Peters M, Neumann M, Iyyer M. Deep contextualized word representations. [C]// Proc of the 2018 Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg, PA: ACL, 2018: 2227–2237
|
[18] |
Cho K, Van Merriënboer B, Gulcehre C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation[J]. arXiv preprint, arXiv: 1406. 1078, 2014
|
[19] |
Tang G, Müller M, Rios A, et al. Why self-attention? a targeted evaluation of neural machine translation architectures[J]. arXiv preprint, arXiv: 1808. 08946, 2018
|
[20] |
Hoffmann J, Borgeaud S, Mensch A, et al. Training compute-optimal large language models[J]. arXiv preprint, arXiv: 2203. 15556, 2022
|
[21] |
Michaud E J, Liu Ziming, Girit U, et al. The quantization model of neural scaling[J]. arXiv preprint, arXiv: 2303. 13506, 2023.
|
[22] |
Sun Tianxiang, Shao Yunfan, Qian Hong, et al. Black-box tuning for language-model-as-a-service[C]//Proc of Int Conf on Machine Learning. New York: PMLR, 2022: 20841−20855
|
[23] |
Akyürek E, Schuurmans D, Andreas J, et al. What learning algorithm is in-context learning? investigations with linear models[J]. arXiv preprint, arXiv: 2211. 15661, 2022
|
[24] |
Dai Damai, Sun Yutao, Dong Li, et al. Why can GPT learn in-context? language models secretly perform gradient descent as meta optimizers[J]. arXiv preprint, arXiv: 2212. 10559, 2022
|
[25] |
Min S, Lyu X, Holtzman A, et al. Rethinking the role of demonstrations: What makes in-context learning work?[J]. arXiv preprint, arXiv: 2202. 12837, 2022
|
[26] |
Wei J, Wei J, Tay Y, et al. Larger language models do in-context learning differently[J]. arXiv preprint, arXiv: 2303. 03846, 2023
|
[27] |
Zhao Z, Wallace E, Feng Si, et al. Calibrate before use: Improving few-shot performance of language models[C]// Proc of Int Conf on Machine Learning. New York: PMLR 2021: 12697−12706
|
[28] |
Wei J, Wang Xuezhi, Schuurmans D, et al. Chain-of-thought prompting elicits reasoning in large language models[J]. Advances in Neural Information Processing Systems, 2022. 35, 24824−24837
|
[29] |
Kojima T, Gu S S, Reid M, et al. Large language models are zero-shot reasoners[J]. Advances in neural information processing systems, 2022. 35, 22199−22213
|
[30] |
Zhou D, Schärli N, Hou L, et al. Least-to-Most prompting enables complex reasoning in large language models[J]. arXiv preprint, arXiv: 2205. 10625, 2022
|
[31] |
Wang Xuezhi, Wei J, Schuurmans D, et al. Self-consistency improves chain of thought reasoning in language models[J]. arXiv preprint, arXiv: 2203. 11171, 2022
|
[32] |
Zhang Zhuosheng, Zhang A, Li Mu, et al. Automatic chain of thought prompting in large language models[J]. arXiv preprint, arXiv: 2210. 03493, 2022
|
[33] |
Khashabi D, Kordi Y, Hajishirzi H. UnifiedQA-v2: Stronger generalization via broader cross-format training[J]. arXiv preprint, arXiv: 2202. 12359, 2022
|
[34] |
Scao T L, Fan A, Akiki C, et al. BLOOM: A 176B-parameter open-access multilingual language model[J]. arXiv preprint, arXiv: 2211. 05100, 2022
|
[35] |
Khashabi D, Min S, Khot T, et al. UnifiedQA: Crossing format boundaries with a single QA system[J]. arXiv preprint, arXiv: 2005. 00700, 2020
|
[36] |
Huang Shaohan, Dong Li, Wang Wenhui, et al. Language is not all you need: Aligning perception with language models[J]. arXiv preprint, arXiv: 2302. 14045, 2023
|
[37] |
Peng Zhiliang, Wang Wenhui, Dong Li, et al. Kosmos-2: Grounding multimodal large language models to the world[J]. arXiv preprint, arXiv: 2306. 14824, 2023
|
[38] |
Ouyang Long, Wu J, Jiang Xu, et al. Training language models to follow instructions with human feedback[J]. Advances in Neural Information Processing Systems, 2022. 35, 27730−27744
|
[39] |
Wang Yizhong, Kordi Y, Mishra S, et al. Self-instruct: Aligning language model with self generated instructions[J]. arXiv preprint, arXiv: 2212. 10560, 2022
|
[40] |
Bai Yuntao, Kadavath S, Kundu S, et al. Constitutional AI: Harmlessness from AI feedback[J]. arXiv preprint, arXiv: 2212. 08073, 2022
|
[41] |
Zheng Lianmin, Chiang W L, Sheng Ying, et al. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena[J]. arXiv preprint, arXiv: 2306. 05685, 2023
|
[42] |
Wang Sinong, Li B Z, Khabsa M, et al. Linformer: Self-attention with linear complexity[J]. arXiv preprint, arXiv: 2006. 04768, 2020
|
[43] |
Dao T, Fu D, Ermon S, et al. Flashattention: Fast and memory-efficient exact attention with io-awareness[J]. Advances in Neural Information Processing Systems, 2022. 35, 16344−16359
|
[44] |
Peng Bo, Alcaide E, Anthony Q, et al. RWKV: Reinventing RNNs for the Transformer Era[J]. arXiv preprint, arXiv: 2305. 13048, 2023
|
[45] |
Schick T, Dwivedi-Yu J, Dessì R, et al. Toolformer: Language models can teach themselves to use tools[J]. arXiv preprint, arXiv: 2302. 04761, 2023
|
[46] |
陈宇飞,沈超,王骞,等. 人工智能系统安全与隐私风险[J]. 计算机研究与发展,2019,56(10):2135−2150 doi: 10.7544/issn1000-1239.2019.20190415
Chen Yufei, Shen Chao, Wang Qian, et al. Security and privacy risks in artificial intelligence system[J]. Journal of Computer Research and Development, 2019. 56(10): 2135−2150(in Chinese) doi: 10.7544/issn1000-1239.2019.20190415
|
[1] | Zhou Junzuo, Yi Jiangyan, Tao Jianhua, Ren Yong, Wang Tao. Mel Spectrogram and Squeeze-Excitation-Weighted Quantization for Neural Speech Codec[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202440329 |
[2] | Wu Jinjin, Liu Quan, Chen Song, Yan Yan. Averaged Weighted Double Deep Q-Network[J]. Journal of Computer Research and Development, 2020, 57(3): 576-589. DOI: 10.7544/issn1000-1239.2020.20190159 |
[3] | Fan Zhengguang, Qu Dan, Yan Honggang, Zhang Wenlin. Joint Acoustic Modeling of Multi-Features Based on Deep Neural Networks[J]. Journal of Computer Research and Development, 2017, 54(5): 1036-1044. DOI: 10.7544/issn1000-1239.2017.20160031 |
[4] | Wang Liang, Wang Weiping, Meng Dan. Privacy Preserving Data Publishing via Weighted Bayesian Networks[J]. Journal of Computer Research and Development, 2016, 53(10): 2343-2353. DOI: 10.7544/issn1000-1239.2016.20160465 |
[5] | Ding Zhaoyun, Zhou Bin, Jia Yan, Wang Xiang. Detecting Spammers with a Bidirectional Vote Algorithm Based on Statistical Features in Microblogs[J]. Journal of Computer Research and Development, 2013, 50(11): 2336-2348. |
[6] | Zhang Xiang, Deng Zhaohong, Wang Shitong, Choi Kupsze. Maximum Entropy Relief Feature Weighting[J]. Journal of Computer Research and Development, 2011, 48(6): 1038-1048. |
[7] | Shen Derong, Ma Ye, Nie Tiezheng, Kou Yue, and Yu Ge. A Query Relaxation Strategy Applied in a Deep Web Data Integration System[J]. Journal of Computer Research and Development, 2010, 47(1): 88-95. |
[8] | Liu He, Liu Dayou, Pei Zhili, Gao Ying. A Feature Weighting Scheme for Text Categorization Based on Feature Importance[J]. Journal of Computer Research and Development, 2009, 46(10): 1693-1703. |
[9] | Fu Yunqing, Wang Songjian, and Wu Zhongfu. A Routing Protocol of Wireless Mesh Network Based on Weighted Link State[J]. Journal of Computer Research and Development, 2009, 46(1): 137-143. |
[10] | Huang Dongping, Liu Duo, and Dai Yiqi. Weighted Threshold Secret Sharing[J]. Journal of Computer Research and Development, 2007, 44(8): 1378-1382. |