• 中国精品科技期刊
  • CCF推荐A类中文期刊
  • 计算领域高质量科技期刊T1类
Advanced Search
Li Dongwen, Zhong Zhenyu, Sun Yufei, Shen Junyu, Ma Zizhi, Yu Chuanyue, Zhang Yuzhi. LingLong: A High-Quality Small-Scale Chinese Pre-trained Language Model[J]. Journal of Computer Research and Development, 2025, 62(3): 682-693. DOI: 10.7544/issn1000-1239.202330844
Citation: Li Dongwen, Zhong Zhenyu, Sun Yufei, Shen Junyu, Ma Zizhi, Yu Chuanyue, Zhang Yuzhi. LingLong: A High-Quality Small-Scale Chinese Pre-trained Language Model[J]. Journal of Computer Research and Development, 2025, 62(3): 682-693. DOI: 10.7544/issn1000-1239.202330844

LingLong: A High-Quality Small-Scale Chinese Pre-trained Language Model

More Information
  • Author Bio:

    Li Dongwen: born in 1997. PhD candidate. Student member of CCF. Her main research interests include natural language processing and deep learning

    Zhong Zhenyu: born in 1997. PhD candidate. His main research interests include natural language processing, high-performance computing, and artificial intelligence operations

    Sun Yufei: born in 1976. PhD, professor, master supervisor. Her main research interests include deep learning, heterogeneous computing, and artificial intelligence

    Shen Junyu: born in 2001. Master candidate.His main research interest includes natural language processing

    Ma Zizhi: born in 2000. Master candidate. His main research interest includes natural language processing

    Yu Chuanyue: born in 2001. Master candidate. Her main research interest includes natural language processing

    Zhang Yuzhi: born in 1964. PhD, professor, PhD supervisor. Member of CCF. His main research interests include deep learning and other artificial intelligence-related fields

  • Received Date: October 30, 2023
  • Revised Date: March 21, 2024
  • Accepted Date: May 29, 2024
  • Available Online: June 30, 2024
  • In recent years, large-scale autoregressive Chinese pre-trained language models (PLMs) have demonstrated outstanding performance on various natural language processing (NLP) tasks. However, these models are computationally expensive, and their word-based vocabulary poses significant challenges for practical applications. In addition, most of them use only unidirectional context information, which may result in performance degradation on many tasks, especially tasks requiring a nuanced understanding of context. To address these challenges, we introduce LingLong, a high-quality small-scale Chinese pre-trained language model. LingLong stands out due to its modest scale, comprising only 317 million parameters, making it highly deployable and resource-efficient. We tokenize the training corpus with a character-based vocabulary to mitigate the negative impacts of unknown tokens and word segmentation errors. Moreover, we go beyond the conventional unidirectional context by introducing a novel backward model. This model is trained by reversing the input order of the training data. Combining LingLong and its backward version allows for the use of bidirectional information on downstream tasks. Extensive experimental results validate the effectiveness of LingLong across a diverse set of NLP tasks. LingLong outperforms similar-sized Chinese PLMs on six downstream tasks and surpasses popular large-scale Chinese PLMs on four downstream tasks. These findings underscore the versatility and efficiency of LingLong, opening up possibilities for practical applications and advancements in the Chinese NLP field.

  • [1]
    Radford A, Narasimhan K, Salimans T, et al. Improving language understanding by generative pre-training[EB/OL]. 2018[2023-09-19]. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
    [2]
    Radford A, Wu J, Child R, et al. Language models are unsupervised multitask learners[J/OL]. OpenAI Blog, 2019, 1(8): 9. [2023-09-20]. https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
    [3]
    Devlin J, Chang M W, Lee K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[C]//Proc of the 2019 Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg, PA: ACL, 2019: 4171−4186
    [4]
    Yang Zhlin, Dai Zihang, Yang Yiming. XLNet: Generalized autoregressive pretraining for language understanding[C]//Proc of the Advances in Neural Information Processing Systems. Cambridge, MA: MIT, 2019: 5754−5764
    [5]
    Lewis M, Liu Yinhan, Goyal N, et al. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension[J]. arXiv preprint, arXiv: 1910.13461, 2019
    [6]
    Brown T, Mann B, Ryder N, et al. Language models are few-shot learners[C]//Proc of the Advances in Neural Information Processing Systems. Cambridge, MA: MIT, 2020: 1877−1901
    [7]
    舒文韬,李睿潇,孙天祥,等. 大型语言模型:原理、实现与发展[J]. 计算机研究与发展,2024,61(2):351−361 doi: 10.7544/issn1000-1239.202330303

    Shu Wentao, Li Ruixiao, Sun Tianxiang, et al. Large language models: Principles, implementation, and progress[J]. Journal of Computer Research and Development, 2024, 61(2): 351−361 (in Chinese) doi: 10.7544/issn1000-1239.202330303
    [8]
    Diao Shizhe, Bai Jiaxin, Song Yan, et al. ZEN: Pre-training Chinese text encoder enhanced by n-gram representations[J]. arXiv preprint, arXiv: 1911.00720, 2019
    [9]
    Liu Yihan, Ott M, Goyal N, et al. RoBERTa: A robustly optimized BERT pretraining approach[J]. arXiv preprint, arXiv: 1907.11692, 2019
    [10]
    Sun Yu, Wang Shuohuan, Li Yukun, et al. ERNIE: Enhanced representation through knowledge integration[J]. arXiv preprint, arXiv: 1904.092233, 2019
    [11]
    Zhang Zhengyan, Han Xu, Zhou Hao, et al. CPM: A large-scale generative Chinese pre-trained language model[J]. AI Open, 2021, 2: 93−99 doi: 10.1016/j.aiopen.2021.07.001
    [12]
    Zeng Wei, Ren Xiaozhe, Su Teng, et al. PanGu-α: Large-scale autoregressive pretrained Chinese language models with auto-parallel computation[J]. arXiv preprint, arXiv: 2104.12369, 2021
    [13]
    Du Zhengxiao, Qian Yujie, Liu Xiao, et al. GLM: General language model pretraining with autoregressive blank infilling[C]//Proc of the 60th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2022: 320–335
    [14]
    Zhang Zhengyan, Gu Yuxian, Han Xu, et al. CPM-2: Large-scale cost-effective pre-trained language models[J]. AI Open, 2021, 2: 216−224 doi: 10.1016/j.aiopen.2021.12.003
    [15]
    Li Xiaoya, Meng Yuxian, Sun Xiaofei, et al. Is word segmentation necessary for deep learning of Chinese representations[C]//Proc of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2019: 3242–3252
    [16]
    Song Kaitao, Xu Tan, Tao Qin, et al. MASS: Masked sequence to sequence pre-training for language generation[C]//Proc of the 36th Int Conf on Machine Learning. New York: ACM, 2019: 9−15
    [17]
    Bi Bin, Li Chenliang, Wu Chen, et al. PALM: Pre-training an autoencoding & autoregressive language model for context-conditioned generation[C]//Proc of the 2020 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2020: 8681−8691
    [18]
    Cui Yiming, Che Wanxiang, Liu Ting, et al. Revisiting pre-trained models for Chinese natural language processing[C]//Findings of the Association for Computational Linguistics: EMNLP. Stroudsburg, PA: ACL, 2020: 657–668
    [19]
    Wei Junqiu, Ren Xiaozhe, Li Xiaoguang, et al. NEZHA: Neural contextualized representation for Chinese language understanding[J]. arXiv preprint, arXiv: 1909.00204, 2019
    [20]
    Bai Jinze, Bai Shuai, Chu Yunfei. Qwen technical report[J]. arXiv preprint, arXiv: 2309.16609, 2023
    [21]
    Xu Liang, Hu Hai, Zhang Xuanwei, et al. CLUE: A Chinese language understanding evaluation benchmark[C]//Proc of the 28th Int Conf on Computational Linguistics. Stroudsburg, PA: ACL, 2020: 4762−4772
    [22]
    Wang Wei, Bi Bin, Yan Ming, et al. StructBERT: Incorporating language structures into pre-training for deep language understanding[C/OL]//Proc of the 8th Int Conf on Learning Representations. Washington DC: ICLR, 2020[2024-09-19]. https://openreview.net/forum?id=BJgQ4lSFPH
    [23]
    Lai Yuxuan, Liu Yijia, Feng Yansong, et al. Lattice-BERT: Leveraging multi-granularity representations in Chinese pre-trained language models[C]//Proc of the 2021 Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg, PA: ACL, 2021: 1716–1731
    [24]
    Xu Liang, Zhang Xuanwei, Dong Qianqian. CLUECorpus 2020: A large-scale Chinese corpus for pre-training language model[J]. arXiv preprint, arXiv: 2003.01355, 2020
    [25]
    李东闻,钟震宇,申峻宇,等. NKCorpus:利用海量网络数据构建大型高质量中文数据集[J]. 数据与计算发展前沿,2022,4(3):30−45

    Li Dongwen, Zhong Zhenyu, Shen Junyu, et al. NKCorpus: Extracting high quality large Chinese dataset from web data[J]. Frontiers of Data and Computing, 2022, 4(3): 30−45 (in Chinese)
    [26]
    中华人民共和国教育部,国家语言文字工作委员会. GF 0023−2020 通用规范汉字表[S]. 北京:人民出版社,2021

    State Language Commission, Ministry of Education of the People's Republic of China. GF 0023-2020 List of Commonly Used Standard Chinese Characters[S]. Beijing: People's Publishing House, 2021 (in Chinese)
    [27]
    Wu Shaohua, Zhao Xudong, Yu Tong, et al. Yuan 1.0: Large-scale pre-trained language model in zero-shot and few-shot learning[J]. arXiv preprint, arXiv: 2110.04725, 2021
    [28]
    南开大学软件学院智能计算实验室. LingLong-mcpt[CP/OL]. 2023[2023-09-19]. https://github.com/NKCSICLab/linglong-mcpt

    Intelligent Computing Laboratory, College of Software, Nankai University. LingLong-mcpt[CP/OL]. 2023[2023-09-19]. https://github.com/NKCSlCLab/linglong-mcpt (in Chinese)
    [29]
    Yao Yuan, Dong Qingxiu, GuaN Jian, et al. CUGE: A Chinese language understanding and generation evaluation benchmark[J]. arXiv preprint, arXiv: 2112.13610, 2021
    [30]
    Li Haoran, Yuan Peng, Xu Song, et al. Aspect-aware multimodal summarization for Chinese e-commerce products[C]//Proc of the AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2020: 8188−8195
    [31]
    Hu Baotian, Chen Qingcai, Zhu Fangze. LCSTS: A large scale Chinese short text summarization dataset[C]//Proc of the 2015 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2015: 1967–1972
    [32]
    Lin C Y. ROUGE: A package for automatic evaluation of summaries[C/OL]//Proc of the Text Summarization Branches Out. 2004: 74−81. [2024-09-19]. https://aclanthology.org/W04-1013.pdf
    [33]
    Shao Zhihong, Huang Minlie, Wen Jiangtao, et al. Long and diverse text generation with planning-based hierarchical variational model[C]//Proc of the 2019 Conf on Empirical Methods in Natural Language Processing and the 9th Int Joint Conf on Natural Language Processing (EMNLP-JCNLP). Stroudsburg, PA: ACL, 2019: 3257–3268
    [34]
    Papineni K, Roukos S, Ward T, et al. BLEU: A method for automatic evaluation of machine translation[C]//Proc of the 40th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2002: 311−318
    [35]
    Li Juntao, Yan Rui. Overview of the NLPCC 2018 shared task: Multi-turn human-computer conversations[C]//Proc of the 7th Natural Language Processing and Chinese Computing. Berlin: Springer, 2018: 446−451
    [36]
    Qiu Xipeng, Qian Peng, Shi Zhan. Overview of the NLPCC-ICCPOL 2016 shared task: Chinese word segmentation for micro-blog texts[C]//Proc of the Natural Language Understanding and Intelligent Applications: 5th CCF Conf on Natural Language Processing and Chinese Computing (NLPCC 2016), and 24th Int Conf on Computer Processing of Oriental Languages. Berlin: Springer, 2016: 901−906
    [37]
    Emerson T. The second international Chinese word segmentation bakeoff[C/OL]//Proc of the 4th SIGHAN Workshop on Chinese Language Processing. 2005[2024-09-19]. https://aclanthology.org/I05-3017/
    [38]
    Liu Xin, Chen Qingcai, Deng Chong, et al. LCQMC: A large-scale Chinese question matching corpus[C]//Proc of the 27th Int Conf on Computational Linguistics. Stroudsburg, PA: ACL, 2018: 1952−1962
    [39]
    Wang Yan, Liu Xiaojiang, Shi Shuming. Deep neural solver for math word problems[C]//Proc of the 2017 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2017: 845−854
    [40]
    Wu Shih-Hung, Liu Chao-Lin, Lee Lung-Hao. Chinese spelling check evaluation at SIGHAN bake-off 2013[C/OL]//Proc of the 7th SIGHAN Workshop on Chinese Language Processing. 2013: 35−42.[2024-09-19]. https://aclanthology.org/W13-4406.pdf
    [41]
    Song Yan, Zhang Tong, Wang Yonggang, et al. ZEN 2.0: Continue training and adaption for n-gram enhanced text encoders[J]. arXiv preprint, arXiv: 2105.01279, 2021
    [42]
    Sun Yu, Wang Shuohuan, Feng Shikun, et al. ERNIE 3.0: Large-scale knowledge enhanced pre-training for language understanding and generation[J]. arXiv preprint, arXiv: 2107.02137, 2021
    [43]
    Xue L, Constant N, Roberts A, et al. mT5: A massively multilingual pre-trained text-to-text transformer[J]. arXiv preprint, arXiv: 2010.11934, 2020
    [44]
    Shao Yunfan, Geng Zhichao, Liu Yitao, et al. CPT: A pre-trained unbalanced transformer for both Chinese language understanding and generation[J]. arXiv preprint, arXiv: 2109.05729, 2021
    [45]
    Liu Wei, Fu Xiyan, Zhang Yue, et al. Lexicon enhanced Chinese sequence labeling using BERT adapter[C]//Proc of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th Int Joint Conf on Natural Language Processing (Volume 1: Long Papers). Stroudsburg, PA: ACL, 2021: 5847–5858

Catalog

    Article views (278) PDF downloads (123) Cited by()

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return