Large Language Models: Principles, Implementation, and Progress

Shu Wentao; Li Ruixiao; Sun Tianxiang; Huang Xuanjing; Qiu Xipeng

doi:10.7544/issn1000-1239.202330303

Journal of Computer Research and Development > 2024 > 61(2): 351-361. > DOI: 10.7544/issn1000-1239.202330303 CSTR: 32373.14.issn1000-1239.202330303

Shu Wentao, Li Ruixiao, Sun Tianxiang, Huang Xuanjing, Qiu Xipeng. Large Language Models: Principles, Implementation, and Progress[J]. Journal of Computer Research and Development, 2024, 61(2): 351-361. DOI: 10.7544/issn1000-1239.202330303

Citation:

PDF (926 KB)

Large Language Models: Principles, Implementation, and Progress

School of Computer Science, Fudan University, Shanghai 200433

More Information

Author Bio:
Shu Wentao: born in 2002. Undergraduate. His main research interests include deep learning, natural language processing, and large language models

Li Ruixiao: born in 2001. Undergraduate. His main research interests include deep learning and natural language processing

Sun Tianxiang: born in 1997. PhD candidate. His main research interests include deep learning and natural language processing

Huang Xuanjing: born in 1972. PhD, professor, PhD supervisor. Distinguished member of CCF. Her main research interests include natural language processing and information retrieval

Qiu Xipeng: born in 1983. PhD, professor, PhD supervisor. Senior member of CCF. His main research interests include artifitial intelligence, natural language processing, and large language models
Received Date: April 06, 2023
Revised Date: September 26, 2023
Available Online: November 12, 2023

Graphical Abstract

Abstract

Abstract

In recent years, the emergence and development of large language models (LLMs) have revolutionized the field of natural language processing and even artificial intelligence. With the increasing number of model parameters and training data, the perplexity of language models decreases in a predictable manner, which implies the improvement of performance on various natural language processing tasks. Therefore, scaling up language models has been a promising way to improve the system intelligence. In this survey, we first review the definition and scope of LLMs and provide a scale standard to distinguish “large” language models from the perspectives of performance and computing. Then, we review the development and representative work of LLMs in three dimensions: data, algorithm, and model architecture, showing how up-scaling in these dimensions drives the development of LLMs at different stages. Next, we discuss the emergent abilities of LLMs and possible interpretations behind them. We highlight three key emergent abilities, i.e., chain-of-thought prompting, in-context learning, and instruction-following, introducing their related advances and applications. Finally, we outline some potential directions and challenges of LLMs.
- natural language processing,
- neural networks,
- large language models,
- pre-training,
- alignment

FullText(HTML)

References (46)

References

[1]	Kaplan J, McCandlish S, Henighan T, et al. Scaling laws for neural language models[J]. arXiv preprint, arXiv: 2001. 08361, 2020
[2]	Brown P F, Della P V J, Desouza P V, et al. Class-based n-gram models of natural language[J]. Computational Linguistics, 1992. 18(4): 467−480
[3]	Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C] //Proc of the 30th Annual Conf on Neural Information Processing Systems. New York: Curran Associates, 2017: 5990−6008
[4]	Lin Tianyang, Wang Yuxin, Liu Xiangyang et al. A survey of Transformers[J]. AI Open, 2021 (3): 111−132
[5]	Wei J, Tay Y, Bommasani R, et al. Emergent abilities of large language models[J]. arXiv preprint, arXiv: 2206. 07682, 2022
[6]	Rajbhandari S, Rasley J, Ruwase O, et al. ZeRo: Memory optimizations toward training trillion parameter models[C]//Proc of: Int Conf for High Performance Computing, Networking, Storage and Analysis (SC20). Piscataway, NJ: IEEE, 2020: 1−16
[7]	Kingma D P, Ba J. Adam: A method for stochastic optimization[J]. arXiv preprint, arXiv: 1412. 6980, 2014
[8]	Hoffmann J, Borgeaud S, Mensch A, et al. Training compute-optimal large language models[J]. arXiv preprint, arXiv: 2203. 15556, 2022
[9]	Brown T, Mann B, Ryder N, et al. Language models are few-shot learners[J]. Advances in Neural Information Processing Systems, 2020. 33, 1877−1901
[10]	Mikolov T, Chen Kai, Corrado G, et al. Efficient estimation of word representations in vector space[J]. arXiv preprint, arXiv: 1301. 3781, 2013
[11]	Radford A, Narasimhan K, Salimans T, et al. Improving language understanding by generative pre-training[DB/OL].https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf, 2018
[12]	Devlin J, Chang M W, Lee K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding[J]. arXiv preprint, arXiv: 1810. 04805, 2018
[13]	Collobert R, Weston J. A unified architecture for natural language processing: Deep neural networks with multitask learning[C]//Proc of the 25th Int Conf on Machine Learning. New York: ACM, 2008: 160−167
[14]	OpenAI. GPT-4 technical report[J]. arXiv preprint, arXiv: 2303. 08774, 2023
[15]	Chung H W, Hou Le, Longpre S, et al. Scaling instruction-finetuned language models[J]. arXiv preprint, arXiv: 2210. 11416, 2022
[16]	Touvron H, Lavril T, Izacard G, et al. LLaMa: Open and efficient foundation language models[J]. arXiv preprint, arXiv: 2302. 13971, 2023
[17]	Peters M, Neumann M, Iyyer M. Deep contextualized word representations. [C]// Proc of the 2018 Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg, PA: ACL, 2018: 2227–2237
[18]	Cho K, Van Merriënboer B, Gulcehre C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation[J]. arXiv preprint, arXiv: 1406. 1078, 2014
[19]	Tang G, Müller M, Rios A, et al. Why self-attention? a targeted evaluation of neural machine translation architectures[J]. arXiv preprint, arXiv: 1808. 08946, 2018
[20]	Hoffmann J, Borgeaud S, Mensch A, et al. Training compute-optimal large language models[J]. arXiv preprint, arXiv: 2203. 15556, 2022
[21]	Michaud E J, Liu Ziming, Girit U, et al. The quantization model of neural scaling[J]. arXiv preprint, arXiv: 2303. 13506, 2023.
[22]	Sun Tianxiang, Shao Yunfan, Qian Hong, et al. Black-box tuning for language-model-as-a-service[C]//Proc of Int Conf on Machine Learning. New York: PMLR, 2022: 20841−20855
[23]	Akyürek E, Schuurmans D, Andreas J, et al. What learning algorithm is in-context learning? investigations with linear models[J]. arXiv preprint, arXiv: 2211. 15661, 2022
[24]	Dai Damai, Sun Yutao, Dong Li, et al. Why can GPT learn in-context? language models secretly perform gradient descent as meta optimizers[J]. arXiv preprint, arXiv: 2212. 10559, 2022
[25]	Min S, Lyu X, Holtzman A, et al. Rethinking the role of demonstrations: What makes in-context learning work?[J]. arXiv preprint, arXiv: 2202. 12837, 2022
[26]	Wei J, Wei J, Tay Y, et al. Larger language models do in-context learning differently[J]. arXiv preprint, arXiv: 2303. 03846, 2023
[27]	Zhao Z, Wallace E, Feng Si, et al. Calibrate before use: Improving few-shot performance of language models[C]// Proc of Int Conf on Machine Learning. New York: PMLR 2021: 12697−12706
[28]	Wei J, Wang Xuezhi, Schuurmans D, et al. Chain-of-thought prompting elicits reasoning in large language models[J]. Advances in Neural Information Processing Systems, 2022. 35, 24824−24837
[29]	Kojima T, Gu S S, Reid M, et al. Large language models are zero-shot reasoners[J]. Advances in neural information processing systems, 2022. 35, 22199−22213
[30]	Zhou D, Schärli N, Hou L, et al. Least-to-Most prompting enables complex reasoning in large language models[J]. arXiv preprint, arXiv: 2205. 10625, 2022
[31]	Wang Xuezhi, Wei J, Schuurmans D, et al. Self-consistency improves chain of thought reasoning in language models[J]. arXiv preprint, arXiv: 2203. 11171, 2022
[32]	Zhang Zhuosheng, Zhang A, Li Mu, et al. Automatic chain of thought prompting in large language models[J]. arXiv preprint, arXiv: 2210. 03493, 2022
[33]	Khashabi D, Kordi Y, Hajishirzi H. UnifiedQA-v2: Stronger generalization via broader cross-format training[J]. arXiv preprint, arXiv: 2202. 12359, 2022
[34]	Scao T L, Fan A, Akiki C, et al. BLOOM: A 176B-parameter open-access multilingual language model[J]. arXiv preprint, arXiv: 2211. 05100, 2022
[35]	Khashabi D, Min S, Khot T, et al. UnifiedQA: Crossing format boundaries with a single QA system[J]. arXiv preprint, arXiv: 2005. 00700, 2020
[36]	Huang Shaohan, Dong Li, Wang Wenhui, et al. Language is not all you need: Aligning perception with language models[J]. arXiv preprint, arXiv: 2302. 14045, 2023
[37]	Peng Zhiliang, Wang Wenhui, Dong Li, et al. Kosmos-2: Grounding multimodal large language models to the world[J]. arXiv preprint, arXiv: 2306. 14824, 2023
[38]	Ouyang Long, Wu J, Jiang Xu, et al. Training language models to follow instructions with human feedback[J]. Advances in Neural Information Processing Systems, 2022. 35, 27730−27744
[39]	Wang Yizhong, Kordi Y, Mishra S, et al. Self-instruct: Aligning language model with self generated instructions[J]. arXiv preprint, arXiv: 2212. 10560, 2022
[40]	Bai Yuntao, Kadavath S, Kundu S, et al. Constitutional AI: Harmlessness from AI feedback[J]. arXiv preprint, arXiv: 2212. 08073, 2022
[41]	Zheng Lianmin, Chiang W L, Sheng Ying, et al. Judging LLM-as-a-judge with MT-Bench and Chatbot Arena[J]. arXiv preprint, arXiv: 2306. 05685, 2023
[42]	Wang Sinong, Li B Z, Khabsa M, et al. Linformer: Self-attention with linear complexity[J]. arXiv preprint, arXiv: 2006. 04768, 2020
[43]	Dao T, Fu D, Ermon S, et al. Flashattention: Fast and memory-efficient exact attention with io-awareness[J]. Advances in Neural Information Processing Systems, 2022. 35, 16344−16359
[44]	Peng Bo, Alcaide E, Anthony Q, et al. RWKV: Reinventing RNNs for the Transformer Era[J]. arXiv preprint, arXiv: 2305. 13048, 2023
[45]	Schick T, Dwivedi-Yu J, Dessì R, et al. Toolformer: Language models can teach themselves to use tools[J]. arXiv preprint, arXiv: 2302. 04761, 2023
[46]	陈宇飞,沈超,王骞,等. 人工智能系统安全与隐私风险[J]. 计算机研究与发展,2019,56(10):2135−2150 doi: 10.7544/issn1000-1239.2019.20190415 Chen Yufei, Shen Chao, Wang Qian, et al. Security and privacy risks in artificial intelligence system[J]. Journal of Computer Research and Development, 2019. 56(10): 2135−2150(in Chinese) doi: 10.7544/issn1000-1239.2019.20190415

[1]	Zhou Yuanding, Gao Guopeng, Fang Yaodong, Qin Chuan. Perceptual Authentication Hashing with Image Feature Fusion Based on Window Self-Attention[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202330669
[2]	Gao Wei, Chen Liqun, Tang Chunming, Zhang Guoyan, Li Fei. One-Time Chameleon Hash Function and Its Application in Redactable Blockchain[J]. Journal of Computer Research and Development, 2021, 58(10): 2310-2318. DOI: 10.7544/issn1000-1239.2021.20210653
[3]	Wu Linyang, Luo Rong, Guo Xueting, Guo Qi. Partitioning Acceleration Between CPU and DRAM: A Case Study on Accelerating Hash Joins in the Big Data Era[J]. Journal of Computer Research and Development, 2018, 55(2): 289-304. DOI: 10.7544/issn1000-1239.2018.20170842
[4]	Jiang Jie, Yang Tong, Zhang Mengyu, Dai Yafei, Huang Liang, Zheng Lianqing. DCuckoo: An Efficient Hash Table with On-Chip Summary[J]. Journal of Computer Research and Development, 2017, 54(11): 2508-2515. DOI: 10.7544/issn1000-1239.2017.20160795
[5]	Wang Wendi, Tang Wen, Duan Bo, Zhang Chunming, Zhang Peiheng, Sun Ninghui. Parallel Accelerator Design for High-Throughput DNA Sequence Alignment with Hash-Index[J]. Journal of Computer Research and Development, 2013, 50(11): 2463-2471.
[6]	Yuan Xinpan, Long Jun, Zhang Zuping, Luo Yueyi, Zhang Hao, and Gui Weihua. Connected Bit Minwise Hashing[J]. Journal of Computer Research and Development, 2013, 50(4): 883-890.
[7]	Qin Chuan, Chang Chin Chen, Guo Cheng. Perceptual Robust Image Hashing Scheme Based on Secret Sharing[J]. Journal of Computer Research and Development, 2012, 49(8): 1690-1698.
[8]	Ding Zhenhua, Li Jintao, Feng Bo. Research on Hash-Based RFID Security Authentication Protocol[J]. Journal of Computer Research and Development, 2009, 46(4): 583-592.
[9]	Li Zhiqiang, Chen Hanwu, Xu Baowen, Liu Wenjie. Fast Algorithms for Synthesis of Quantum Reversible Logic Circuits Based on Hash Table[J]. Journal of Computer Research and Development, 2008, 45(12): 2162-2171.
[10]	Liu Ji. One-Way Hash Function based on Integer Coupled Tent Maps and Its Performance Analysis[J]. Journal of Computer Research and Development, 2008, 45(3): 563-569.