Citation: | Shu Wentao, Li Ruixiao, Sun Tianxiang, Huang Xuanjing, Qiu Xipeng. Large Language Models: Principles, Implementation, and Progress[J]. Journal of Computer Research and Development, 2024, 61(2): 351-361. DOI: 10.7544/issn1000-1239.202330303 |
In recent years, the emergence and development of large language models (LLMs) have revolutionized the field of natural language processing and even artificial intelligence. With the increasing number of model parameters and training data, the perplexity of language models decreases in a predictable manner, which implies the improvement of performance on various natural language processing tasks. Therefore, scaling up language models has been a promising way to improve the system intelligence. In this survey, we first review the definition and scope of LLMs and provide a scale standard to distinguish “large” language models from the perspectives of performance and computing. Then, we review the development and representative work of LLMs in three dimensions: data, algorithm, and model architecture, showing how up-scaling in these dimensions drives the development of LLMs at different stages. Next, we discuss the emergent abilities of LLMs and possible interpretations behind them. We highlight three key emergent abilities, i.e., chain-of-thought prompting, in-context learning, and instruction-following, introducing their related advances and applications. Finally, we outline some potential directions and challenges of LLMs.
[1] |
Kaplan J, McCandlish S, Henighan T, et al. Scaling laws for neural language models[J]. arXiv preprint, arXiv: 2001. 08361, 2020
|
[2] |
Brown P F, Della P V J, Desouza P V, et al. Class-based n-gram models of natural language[J]. Computational Linguistics, 1992. 18(4): 467−480
|
[3] |
Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C] //Proc of the 30th Annual Conf on Neural Information Processing Systems. New York: Curran Associates, 2017: 5990−6008
|
[4] |
Lin Tianyang, Wang Yuxin, Liu Xiangyang et al. A survey of Transformers[J]. AI Open, 2021 (3): 111−132
|
[5] |
Wei J, Tay Y, Bommasani R, et al. Emergent abilities of large language models[J]. arXiv preprint, arXiv: 2206. 07682, 2022
|
[6] |
Rajbhandari S, Rasley J, Ruwase O, et al. ZeRo: Memory optimizations toward training trillion parameter models[C]//Proc of: Int Conf for High Performance Computing, Networking, Storage and Analysis (SC20). Piscataway, NJ: IEEE, 2020: 1−16
|
[7] |
Kingma D P, Ba J. Adam: A method for stochastic optimization[J]. arXiv preprint, arXiv: 1412. 6980, 2014
|
[8] |
Hoffmann J, Borgeaud S, Mensch A, et al. Training compute-optimal large language models[J]. arXiv preprint, arXiv: 2203. 15556, 2022
|
[9] |
Brown T, Mann B, Ryder N, et al. Language models are few-shot learners[J]. Advances in Neural Information Processing Systems, 2020. 33, 1877−1901
|
[10] |
Mikolov T, Chen Kai, Corrado G, et al. Efficient estimation of word representations in vector space[J]. arXiv preprint, arXiv: 1301. 3781, 2013
|
[11] |
Radford A, Narasimhan K, Salimans T, et al. Improving language understanding by generative pre-training[DB/OL].https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf, 2018
|
[12] |
Devlin J, Chang M W, Lee K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding[J]. arXiv preprint, arXiv: 1810. 04805, 2018
|
[13] |
Collobert R, Weston J. A unified architecture for natural language processing: Deep neural networks with multitask learning[C]//Proc of the 25th Int Conf on Machine Learning. New York: ACM, 2008: 160−167
|
[14] |
OpenAI. GPT-4 technical report[J]. arXiv preprint, arXiv: 2303. 08774, 2023
|
[15] |
Chung H W, Hou Le, Longpre S, et al. Scaling instruction-finetuned language models[J]. arXiv preprint, arXiv: 2210. 11416, 2022
|
[16] |
Touvron H, Lavril T, Izacard G, et al. LLaMa: Open and efficient foundation language models[J]. arXiv preprint, arXiv: 2302. 13971, 2023
|
[17] |
Peters M, Neumann M, Iyyer M. Deep contextualized word representations. [C]// Proc of the 2018 Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg, PA: ACL, 2018: 2227–2237
|
[18] |
Cho K, Van Merriënboer B, Gulcehre C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation[J]. arXiv preprint, arXiv: 1406. 1078, 2014
|
[19] |
Tang G, Müller M, Rios A, et al. Why self-attention? a targeted evaluation of neural machine translation architectures[J]. arXiv preprint, arXiv: 1808. 08946, 2018
|
[20] |
Hoffmann J, Borgeaud S, Mensch A, et al. Training compute-optimal large language models[J]. arXiv preprint, arXiv: 2203. 15556, 2022
|
[21] |
Michaud E J, Liu Ziming, Girit U, et al. The quantization model of neural scaling[J]. arXiv preprint, arXiv: 2303. 13506, 2023.
|
[22] |
Sun Tianxiang, Shao Yunfan, Qian Hong, et al. Black-box tuning for language-model-as-a-service[C]//Proc of Int Conf on Machine Learning. New York: PMLR, 2022: 20841−20855
|
[23] |
Akyürek E, Schuurmans D, Andreas J, et al. What learning algorithm is in-context learning? investigations with linear models[J]. arXiv preprint, arXiv: 2211. 15661, 2022
|
[24] |
Dai Damai, Sun Yutao, Dong Li, et al. Why can GPT learn in-context? language models secretly perform gradient descent as meta optimizers[J]. arXiv preprint, arXiv: 2212. 10559, 2022
|
[25] |
Min S, Lyu X, Holtzman A, et al. Rethinking the role of demonstrations: What makes in-context learning work?[J]. arXiv preprint, arXiv: 2202. 12837, 2022
|
[26] |
Wei J, Wei J, Tay Y, et al. Larger language models do in-context learning differently[J]. arXiv preprint, arXiv: 2303. 03846, 2023
|
[27] |
Zhao Z, Wallace E, Feng Si, et al. Calibrate before use: Improving few-shot performance of language models[C]// Proc of Int Conf on Machine Learning. New York: PMLR 2021: 12697−12706
|
[28] |
Wei J, Wang Xuezhi, Schuurmans D, et al. Chain-of-thought prompting elicits reasoning in large language models[J]. Advances in Neural Information Processing Systems, 2022. 35, 24824−24837
|
[29] |
Kojima T, Gu S S, Reid M, et al. Large language models are zero-shot reasoners[J]. Advances in neural information processing systems, 2022. 35, 22199−22213
|
[30] |
Zhou D, Schärli N, Hou L, et al. Least-to-Most prompting enables complex reasoning in large language models[J]. arXiv preprint, arXiv: 2205. 10625, 2022
|
[31] |
Wang Xuezhi, Wei J, Schuurmans D, et al. Self-consistency improves chain of thought reasoning in language models[J]. arXiv preprint, arXiv: 2203. 11171, 2022
|
[32] |
Zhang Zhuosheng, Zhang A, Li Mu, et al. Automatic chain of thought prompting in large language models[J]. arXiv preprint, arXiv: 2210. 03493, 2022
|
[33] |
Khashabi D, Kordi Y, Hajishirzi H. UnifiedQA-v2: Stronger generalization via broader cross-format training[J]. arXiv preprint, arXiv: 2202. 12359, 2022
|
[34] |
Scao T L, Fan A, Akiki C, et al. BLOOM: A 176B-parameter open-access multilingual language model[J]. arXiv preprint, arXiv: 2211. 05100, 2022
|
[35] |
Khashabi D, Min S, Khot T, et al. UnifiedQA: Crossing format boundaries with a single QA system[J]. arXiv preprint, arXiv: 2005. 00700, 2020
|
[36] |
Huang Shaohan, Dong Li, Wang Wenhui, et al. Language is not all you need: Aligning perception with language models[J]. arXiv preprint, arXiv: 2302. 14045, 2023
|
[37] |
Peng Zhiliang, Wang Wenhui, Dong Li, et al. Kosmos-2: Grounding multimodal large language models to the world[J]. arXiv preprint, arXiv: 2306. 14824, 2023
|
[38] |
Ouyang Long, Wu J, Jiang Xu, et al. Training language models to follow instructions with human feedback[J]. Advances in Neural Information Processing Systems, 2022. 35, 27730−27744
|
[39] |
Wang Yizhong, Kordi Y, Mishra S, et al. Self-instruct: Aligning language model with self generated instructions[J]. arXiv preprint, arXiv: 2212. 10560, 2022
|
[40] |
Bai Yuntao, Kadavath S, Kundu S, et al. Constitutional AI: Harmlessness from AI feedback[J]. arXiv preprint, arXiv: 2212. 08073, 2022
|
[41] |
Zheng Lianmin, Chiang W L, Sheng Ying, et al. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena[J]. arXiv preprint, arXiv: 2306. 05685, 2023
|
[42] |
Wang Sinong, Li B Z, Khabsa M, et al. Linformer: Self-attention with linear complexity[J]. arXiv preprint, arXiv: 2006. 04768, 2020
|
[43] |
Dao T, Fu D, Ermon S, et al. Flashattention: Fast and memory-efficient exact attention with io-awareness[J]. Advances in Neural Information Processing Systems, 2022. 35, 16344−16359
|
[44] |
Peng Bo, Alcaide E, Anthony Q, et al. RWKV: Reinventing RNNs for the Transformer Era[J]. arXiv preprint, arXiv: 2305. 13048, 2023
|
[45] |
Schick T, Dwivedi-Yu J, Dessì R, et al. Toolformer: Language models can teach themselves to use tools[J]. arXiv preprint, arXiv: 2302. 04761, 2023
|
[46] |
陈宇飞,沈超,王骞,等. 人工智能系统安全与隐私风险[J]. 计算机研究与发展,2019,56(10):2135−2150 doi: 10.7544/issn1000-1239.2019.20190415
Chen Yufei, Shen Chao, Wang Qian, et al. Security and privacy risks in artificial intelligence system[J]. Journal of Computer Research and Development, 2019. 56(10): 2135−2150(in Chinese) doi: 10.7544/issn1000-1239.2019.20190415
|
1. |
戎珂,施新伟,吕若明. “i7算”赋能AI产业生态可持续发展. 科学学研究. 2025(01): 197-204 .
![]() | |
2. |
张浩严,吕文涛,余润泽,邓志江. 大语言模型研究现状. 无线电工程. 2025(01): 163-174 .
![]() | |
3. |
李东闻,钟震宇,孙羽菲,申峻宇,马子智,于川越,张玉志. 玲珑:一个小规模的高质量中文预训练语言模型. 计算机研究与发展. 2025(03): 682-693 .
![]() | |
4. |
陶江垚,奚雪峰,盛胜利,崔志明,左严. 结构化思维提示增强大语言模型推理能力综述. 计算机工程与应用. 2025(06): 64-83 .
![]() | |
5. |
魏楚元,王昕,周小平,赵光哲,黄明. 大型语言模型及其在建筑行业应用研究综述. 北京建筑大学学报. 2024(02): 1-14+121 .
![]() | |
6. |
庞进喜. 大模型在汽车国际化多语言处理中的应用. 中国汽车. 2024(05): 14-20 .
![]() | |
7. |
王晓璐,杨云轩,谢阳斌. 创造人机对话式学习新形态——大语言模型的教育应用现状与展望. 中小学信息技术教育. 2024(05): 15-17 .
![]() | |
8. |
马伟民. 自然语言大模型技术在政务服务智能客服系统建设中的应用. 信息与电脑(理论版). 2024(08): 86-88 .
![]() | |
9. |
曾白凌. “被中介的真理”:Sora对媒介相合性的追问. 现代传播(中国传媒大学学报). 2024(05): 1-10 .
![]() | |
10. |
童俊杰,申佳,赫罡,张奎. 运营商智算中心建设思路及方案. 邮电设计技术. 2024(09): 68-73 .
![]() | |
11. |
刘同军. 生成式人工智能革新数学教学:场景与案例. 中学数学杂志. 2024(10): 1-4 .
![]() | |
12. |
尹为民. 一种基于预训练模型的类增量学习近似重放方法分析. 电子技术. 2024(10): 144-145 .
![]() | |
13. |
崔金满,李冬梅,田萱,孟湘皓,杨宇,崔晓晖. 提示学习研究综述. 计算机工程与应用. 2024(23): 1-27 .
![]() | |
14. |
王珍珍,向巴卓玛,赵岩松,马星光. 以ChatGPT为代表的大型语言模型在医学教学中的应用. 医学教育管理. 2024(06): 692-697 .
![]() | |
15. |
王琳. 大语言模型技术背景下重塑研究生论文评价与指导. 学位与研究生教育. 2024(12): 30-37 .
![]() | |
16. |
朱俊仪,朱尚明. 利用检索增强生成技术开发本地知识库应用. 通信学报. 2024(S2): 242-247 .
![]() |