LingLong: A High-Quality Small-Scale Chinese Pre-trained Language Model

Li Dongwen; Zhong Zhenyu; Sun Yufei; Shen Junyu; Ma Zizhi; Yu Chuanyue; Zhang Yuzhi

doi:10.7544/issn1000-1239.202330844

Journal of Computer Research and Development > 2025 > 62(3): 682-693. > DOI: 10.7544/issn1000-1239.202330844 CSTR: 32373.14.issn1000-1239.202330844

Li Dongwen, Zhong Zhenyu, Sun Yufei, Shen Junyu, Ma Zizhi, Yu Chuanyue, Zhang Yuzhi. LingLong: A High-Quality Small-Scale Chinese Pre-trained Language Model[J]. Journal of Computer Research and Development, 2025, 62(3): 682-693. DOI: 10.7544/issn1000-1239.202330844

Citation:

PDF (966 KB)

LingLong: A High-Quality Small-Scale Chinese Pre-trained Language Model

1.
College of Software, Nankai University, Tianjin 300450
2.
Haihe Laboratory of Information Technology Application Innovation, Tianjin 300450

More Information

Author Bio:
Li Dongwen: born in 1997. PhD candidate. Student member of CCF. Her main research interests include natural language processing and deep learning

Zhong Zhenyu: born in 1997. PhD candidate. His main research interests include natural language processing, high-performance computing, and artificial intelligence operations

Sun Yufei: born in 1976. PhD, professor, master supervisor. Her main research interests include deep learning, heterogeneous computing, and artificial intelligence

Shen Junyu: born in 2001. Master candidate.His main research interest includes natural language processing

Ma Zizhi: born in 2000. Master candidate. His main research interest includes natural language processing

Yu Chuanyue: born in 2001. Master candidate. Her main research interest includes natural language processing

Zhang Yuzhi: born in 1964. PhD, professor, PhD supervisor. Member of CCF. His main research interests include deep learning and other artificial intelligence-related fields
Received Date: October 30, 2023
Revised Date: March 21, 2024
Accepted Date: May 29, 2024
Available Online: June 30, 2024

Graphical Abstract

Abstract

Abstract

In recent years, large-scale autoregressive Chinese pre-trained language models (PLMs) have demonstrated outstanding performance on various natural language processing (NLP) tasks. However, these models are computationally expensive, and their word-based vocabulary poses significant challenges for practical applications. In addition, most of them use only unidirectional context information, which may result in performance degradation on many tasks, especially tasks requiring a nuanced understanding of context. To address these challenges, we introduce LingLong, a high-quality small-scale Chinese pre-trained language model. LingLong stands out due to its modest scale, comprising only 317 million parameters, making it highly deployable and resource-efficient. We tokenize the training corpus with a character-based vocabulary to mitigate the negative impacts of unknown tokens and word segmentation errors. Moreover, we go beyond the conventional unidirectional context by introducing a novel backward model. This model is trained by reversing the input order of the training data. Combining LingLong and its backward version allows for the use of bidirectional information on downstream tasks. Extensive experimental results validate the effectiveness of LingLong across a diverse set of NLP tasks. LingLong outperforms similar-sized Chinese PLMs on six downstream tasks and surpasses popular large-scale Chinese PLMs on four downstream tasks. These findings underscore the versatility and efficiency of LingLong, opening up possibilities for practical applications and advancements in the Chinese NLP field.
- Chinese pre-trained language model,
- small-scale,
- character-based model,
- backward model,
- bidirectional information

FullText(HTML)

References (45)

References

[1]	Radford A, Narasimhan K, Salimans T, et al. Improving language understanding by generative pre-training[EB/OL]. 2018[2023-09-19]. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
[2]	Radford A, Wu J, Child R, et al. Language models are unsupervised multitask learners[J/OL]. OpenAI Blog, 2019, 1(8): 9. [2023-09-20]. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
[3]	Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[C]//Proc of the 2019 Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg, PA: ACL, 2019: 4171−4186
[4]	Yang Zhlin, Dai Zihang, Yang Yiming. XLNet: Generalized autoregressive pretraining for language understanding[C]//Proc of the Advances in Neural Information Processing Systems. Cambridge, MA: MIT, 2019: 5754−5764
[5]	Lewis M, Liu Yinhan, Goyal N, et al. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension[J]. arXiv preprint, arXiv: 1910.13461, 2019
[6]	Brown T, Mann B, Ryder N, et al. Language models are few-shot learners[C]//Proc of the Advances in Neural Information Processing Systems. Cambridge, MA: MIT, 2020: 1877−1901
[7]	舒文韬,李睿潇,孙天祥,等. 大型语言模型:原理、实现与发展[J]. 计算机研究与发展,2024,61(2):351−361 doi: 10.7544/issn1000-1239.202330303 Shu Wentao, Li Ruixiao, Sun Tianxiang, et al. Large language models: Principles, implementation, and progress[J]. Journal of Computer Research and Development, 2024, 61(2): 351−361 (in Chinese) doi: 10.7544/issn1000-1239.202330303
[8]	Diao Shizhe, Bai Jiaxin, Song Yan, et al. ZEN: Pre-training Chinese text encoder enhanced by n-gram representations[J]. arXiv preprint, arXiv: 1911.00720, 2019
[9]	Liu Yihan, Ott M, Goyal N, et al. RoBERTa: A robustly optimized BERT pretraining approach[J]. arXiv preprint, arXiv: 1907.11692, 2019
[10]	Sun Yu, Wang Shuohuan, Li Yukun, et al. ERNIE: Enhanced representation through knowledge integration[J]. arXiv preprint, arXiv: 1904.092233, 2019
[11]	Zhang Zhengyan, Han Xu, Zhou Hao, et al. CPM: A large-scale generative Chinese pre-trained language model[J]. AI Open, 2021, 2: 93−99 doi: 10.1016/j.aiopen.2021.07.001
[12]	Zeng Wei, Ren Xiaozhe, Su Teng, et al. PanGu-α: Large-scale autoregressive pretrained Chinese language models with auto-parallel computation[J]. arXiv preprint, arXiv: 2104.12369, 2021
[13]	Du Zhengxiao, Qian Yujie, Liu Xiao, et al. GLM: General language model pretraining with autoregressive blank infilling[C]//Proc of the 60th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2022: 320–335
[14]	Zhang Zhengyan, Gu Yuxian, Han Xu, et al. CPM-2: Large-scale cost-effective pre-trained language models[J]. AI Open, 2021, 2: 216−224 doi: 10.1016/j.aiopen.2021.12.003
[15]	Li Xiaoya, Meng Yuxian, Sun Xiaofei, et al. Is word segmentation necessary for deep learning of Chinese representations[C]//Proc of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2019: 3242–3252
[16]	Song Kaitao, Xu Tan, Tao Qin, et al. MASS: Masked sequence to sequence pre-training for language generation[C]//Proc of the 36th Int Conf on Machine Learning. New York: ACM, 2019: 9−15
[17]	Bi Bin, Li Chenliang, Wu Chen, et al. PALM: Pre-training an autoencoding & autoregressive language model for context-conditioned generation[C]//Proc of the 2020 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2020: 8681−8691
[18]	Cui Yiming, Che Wanxiang, Liu Ting, et al. Revisiting pre-trained models for Chinese natural language processing[C]//Findings of the Association for Computational Linguistics: EMNLP. Stroudsburg, PA: ACL, 2020: 657–668
[19]	Wei Junqiu, Ren Xiaozhe, Li Xiaoguang, et al. NEZHA: Neural contextualized representation for Chinese language understanding[J]. arXiv preprint, arXiv: 1909.00204, 2019
[20]	Bai Jinze, Bai Shuai, Chu Yunfei. Qwen technical report[J]. arXiv preprint, arXiv: 2309.16609, 2023
[21]	Xu Liang, Hu Hai, Zhang Xuanwei, et al. CLUE: A Chinese language understanding evaluation benchmark[C]//Proc of the 28th Int Conf on Computational Linguistics. Stroudsburg, PA: ACL, 2020: 4762−4772
[22]	Wang Wei, Bi Bin, Yan Ming, et al. StructBERT: Incorporating language structures into pre-training for deep language understanding[C/OL]//Proc of the 8th Int Conf on Learning Representations. Washington DC: ICLR, 2020[2024-09-19]. https://openreview.net/forum?id=BJgQ4lSFPH
[23]	Lai Yuxuan, Liu Yijia, Feng Yansong, et al. Lattice-BERT: Leveraging multi-granularity representations in Chinese pre-trained language models[C]//Proc of the 2021 Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg, PA: ACL, 2021: 1716–1731
[24]	Xu Liang, Zhang Xuanwei, Dong Qianqian. CLUECorpus 2020: A large-scale Chinese corpus for pre-training language model[J]. arXiv preprint, arXiv: 2003.01355, 2020
[25]	李东闻,钟震宇,申峻宇,等. NKCorpus:利用海量网络数据构建大型高质量中文数据集[J]. 数据与计算发展前沿,2022,4(3):30−45 Li Dongwen, Zhong Zhenyu, Shen Junyu, et al. NKCorpus: Extracting high quality large Chinese dataset from web data[J]. Frontiers of Data and Computing, 2022, 4(3): 30−45 (in Chinese)
[26]	中华人民共和国教育部,国家语言文字工作委员会. GF 0023−2020 通用规范汉字表[S]. 北京:人民出版社,2021 State Language Commission, Ministry of Education of the People's Republic of China. GF 0023-2020 List of Commonly Used Standard Chinese Characters[S]. Beijing: People's Publishing House, 2021 (in Chinese)
[27]	Wu Shaohua, Zhao Xudong, Yu Tong, et al. Yuan 1.0: Large-scale pre-trained language model in zero-shot and few-shot learning[J]. arXiv preprint, arXiv: 2110.04725, 2021
[28]	南开大学软件学院智能计算实验室. LingLong-mcpt[CP/OL]. 2023[2023-09-19]. https://github.com/NKCSICLab/linglong-mcpt Intelligent Computing Laboratory, College of Software, Nankai University. LingLong-mcpt[CP/OL]. 2023[2023-09-19]. https://github.com/NKCSlCLab/linglong-mcpt (in Chinese)
[29]	Yao Yuan, Dong Qingxiu, GuaN Jian, et al. CUGE: A Chinese language understanding and generation evaluation benchmark[J]. arXiv preprint, arXiv: 2112.13610, 2021
[30]	Li Haoran, Yuan Peng, Xu Song, et al. Aspect-aware multimodal summarization for Chinese e-commerce products[C]//Proc of the AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2020: 8188−8195
[31]	Hu Baotian, Chen Qingcai, Zhu Fangze. LCSTS: A large scale Chinese short text summarization dataset[C]//Proc of the 2015 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2015: 1967–1972
[32]	Lin C Y. ROUGE: A package for automatic evaluation of summaries[C/OL]//Proc of the Text Summarization Branches Out. 2004: 74−81. [2024-09-19]. https://aclanthology.org/W04-1013.pdf
[33]	Shao Zhihong, Huang Minlie, Wen Jiangtao, et al. Long and diverse text generation with planning-based hierarchical variational model[C]//Proc of the 2019 Conf on Empirical Methods in Natural Language Processing and the 9th Int Joint Conf on Natural Language Processing (EMNLP-JCNLP). Stroudsburg, PA: ACL, 2019: 3257–3268
[34]	Papineni K, Roukos S, Ward T, et al. BLEU: A method for automatic evaluation of machine translation[C]//Proc of the 40th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2002: 311−318
[35]	Li Juntao, Yan Rui. Overview of the NLPCC 2018 shared task: Multi-turn human-computer conversations[C]//Proc of the 7th Natural Language Processing and Chinese Computing. Berlin: Springer, 2018: 446−451
[36]	Qiu Xipeng, Qian Peng, Shi Zhan. Overview of the NLPCC-ICCPOL 2016 shared task: Chinese word segmentation for micro-blog texts[C]//Proc of the Natural Language Understanding and Intelligent Applications: 5th CCF Conf on Natural Language Processing and Chinese Computing (NLPCC 2016), and 24th Int Conf on Computer Processing of Oriental Languages. Berlin: Springer, 2016: 901−906
[37]	Emerson T. The second international Chinese word segmentation bakeoff[C/OL]//Proc of the 4th SIGHAN Workshop on Chinese Language Processing. 2005[2024-09-19]. https://aclanthology.org/I05-3017/
[38]	Liu Xin, Chen Qingcai, Deng Chong, et al. LCQMC: A large-scale Chinese question matching corpus[C]//Proc of the 27th Int Conf on Computational Linguistics. Stroudsburg, PA: ACL, 2018: 1952−1962
[39]	Wang Yan, Liu Xiaojiang, Shi Shuming. Deep neural solver for math word problems[C]//Proc of the 2017 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2017: 845−854
[40]	Wu Shih-Hung, Liu Chao-Lin, Lee Lung-Hao. Chinese spelling check evaluation at SIGHAN bake-off 2013[C/OL]//Proc of the 7th SIGHAN Workshop on Chinese Language Processing. 2013: 35−42.[2024-09-19]. https://aclanthology.org/W13-4406.pdf
[41]	Song Yan, Zhang Tong, Wang Yonggang, et al. ZEN 2.0: Continue training and adaption for n-gram enhanced text encoders[J]. arXiv preprint, arXiv: 2105.01279, 2021
[42]	Sun Yu, Wang Shuohuan, Feng Shikun, et al. ERNIE 3.0: Large-scale knowledge enhanced pre-training for language understanding and generation[J]. arXiv preprint, arXiv: 2107.02137, 2021
[43]	Xue L, Constant N, Roberts A, et al. mT5: A massively multilingual pre-trained text-to-text transformer[J]. arXiv preprint, arXiv: 2010.11934, 2020
[44]	Shao Yunfan, Geng Zhichao, Liu Yitao, et al. CPT: A pre-trained unbalanced transformer for both Chinese language understanding and generation[J]. arXiv preprint, arXiv: 2109.05729, 2021
[45]	Liu Wei, Fu Xiyan, Zhang Yue, et al. Lexicon enhanced Chinese sequence labeling using BERT adapter[C]//Proc of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th Int Joint Conf on Natural Language Processing (Volume 1: Long Papers). Stroudsburg, PA: ACL, 2021: 5847–5858

[1]	Zhang Jing, Ju Jialiang, Ren Yonggong. Double-Generators Network for Data-Free Knowledge Distillation[J]. Journal of Computer Research and Development, 2023, 60(7): 1615-1627. DOI: 10.7544/issn1000-1239.202220024
[2]	Zhao Jingxin, Yue Xinghui, Feng Chongpeng, Zhang Jing, Li Yin, Wang Na, Ren Jiadong, Zhang Haoxing, Wu Gaofei, Zhu Xiaoyan, Zhang Yuqing. Survey of Data Privacy Security Based on General Data Protection Regulation[J]. Journal of Computer Research and Development, 2022, 59(10): 2130-2163. DOI: 10.7544/issn1000-1239.20220800
[3]	Song Xuan, Gao Yunjun, Li Yong, Guan Qingfeng, Meng Xiaofeng. Spatial Data Intelligence: Concept, Technology and Challenges[J]. Journal of Computer Research and Development, 2022, 59(2): 255-263. DOI: 10.7544/issn1000-1239.20220108
[4]	Wang Huiyong, Tang Shijie, Ding Yong, Wang Yujue, Li Jiahui. Survey on Biometrics Template Protection[J]. Journal of Computer Research and Development, 2020, 57(5): 1003-1021. DOI: 10.7544/issn1000-1239.2020.20190371
[5]	Wang Huifeng, Li Zhanhuai, Zhang Xiao, Sun Jian, Zhao Xiaonan. A Self-Adaptive Audit Method of Data Integrity in the Cloud Storage[J]. Journal of Computer Research and Development, 2017, 54(1): 172-183. DOI: 10.7544/issn1000-1239.2017.20150900
[6]	Wang Liang, Wang Weiping, Meng Dan. Privacy Preserving Data Publishing via Weighted Bayesian Networks[J]. Journal of Computer Research and Development, 2016, 53(10): 2343-2353. DOI: 10.7544/issn1000-1239.2016.20160465
[7]	Wang Jing, Huang Chuanhe, Wang Jinhai. An Access Control Mechanism with Dynamic Privilege for Cloud Storage[J]. Journal of Computer Research and Development, 2016, 53(4): 904-920. DOI: 10.7544/issn1000-1239.2016.20150158
[8]	Fu Yingxun, Luo Shengmei, Shu Jiwu. Survey of Secure Cloud Storage System and Key Technologies[J]. Journal of Computer Research and Development, 2013, 50(1): 136-145.
[9]	Hou Qinghua, Wu Yongwei, Zheng Weimin, and Yang Guangwen. A Method on Protection of User Data Privacy in Cloud Storage Platform[J]. Journal of Computer Research and Development, 2011, 48(7): 1146-1154.
[10]	Ren Wei, Ren Yi, Zhang Hui, Zhao Junge. A Secure and Efficient Data Survival Strategy in Unattended Wireless Sensor Network[J]. Journal of Computer Research and Development, 2009, 46(12): 2093-2100.

Cited By

Cited by

Periodical cited type(9)

1.	陈彩华，佘程熙，王庆阳. 可信机器学习综述. 工业工程. 2024(02): 14-26 .
2.	饶高琦，周立炜. 论语言智能的治理. 语言战略研究. 2024(03): 38-48 .
3.	穆春阳，李闯，马行，刘永鹿，杨科，刘宝成. 改进YOLOv7-tiny的轻量化大型铸件焊缝缺陷检测. 组合机床与自动化加工技术. 2024(07): 156-160 .
4.	喻继军，熊明华. 电子商务推荐系统公平性研究进展. 现代信息科技. 2023(14): 115-124 .
5.	范卓娅，孟小峰. 算法公平与公平计算. 计算机研究与发展. 2023(09): 2048-2066 . 本站查看
6.	吴雷，杜文研，林超然. 基于专利数据应用LDA和N-BEATS组合方法的技术主题预测研究. 数字图书馆论坛. 2023(11): 62-73 .
7.	古天龙，李龙，常亮，罗义琴. 公平机器学习:概念、分析与设计. 计算机学报. 2022(05): 1018-1051 .
8.	王文鑫，张健毅. 联邦学习公平性研究综述. 北京电子科技学院学报. 2022(02): 122-134 .
9.	郁建兴，刘宇轩. 社会治理中的深度学习算法公平性. 信息技术与管理应用. 2022(01): 17-27 .