高级检索
    李东闻, 钟震宇, 孙羽菲, 申峻宇, 马子智, 于川越, 张玉志. 玲珑:一个小规模的高质量中文预训练语言模型[J]. 计算机研究与发展. DOI: 10.7544/issn1000-1239.202330844
    引用本文: 李东闻, 钟震宇, 孙羽菲, 申峻宇, 马子智, 于川越, 张玉志. 玲珑:一个小规模的高质量中文预训练语言模型[J]. 计算机研究与发展. DOI: 10.7544/issn1000-1239.202330844
    Li Dongwen, Zhong Zhenyu, Sun Yufei, Shen Junyu, Ma Zizhi, Yu Chuanyue, Zhang Yuzhi. LingLong: A High-Quality Small-Scale Chinese Pre-trained Language Model[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202330844
    Citation: Li Dongwen, Zhong Zhenyu, Sun Yufei, Shen Junyu, Ma Zizhi, Yu Chuanyue, Zhang Yuzhi. LingLong: A High-Quality Small-Scale Chinese Pre-trained Language Model[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202330844

    玲珑:一个小规模的高质量中文预训练语言模型

    LingLong: A High-Quality Small-Scale Chinese Pre-trained Language Model

    • 摘要: 近年来,大规模的、基于自回归的中文预训练语言模型在各种自然语言处理任务上表现出优异性能. 然而,高昂的计算成本以及基于中文词切分数据给中文预训练语言模型实际应用带来了巨大挑战. 此外,大多基于自回归的模型只能使用单向前文信息,可能会导致模型在上下文敏感任务上的性能有所下降. 为了解决以上问题,提出并训练了一个高质量的小型中文预训练语言模型——玲珑. 该模型仅有3.17亿个参数,较小的规模使得玲珑十分容易部署和应用. 使用基于汉字的策略对训练语料进行切分,可以有效减轻未知标记和分词错误带来的负面影响,增强了玲珑在下游任务上的性能. 此外,通过对每条训练数据的输入顺序进行逆序处理,训练了一个反向玲珑模型. 将玲珑与其反向版本相结合,可以实现在下游任务中使用双向信息. 多种自然语言处理下游任务上的实验结果表明,玲珑具有不错的处理下游任务的能力. 在6个数据集上玲珑超越了相近规模模型的性能,在5个数据集上超越了大模型的性能.

       

      Abstract: In recent years, large-scale autoregressive Chinese pre-trained language models (PLMs) have demonstrated outstanding performance on various natural language processing (NLP) tasks. However, these models are computationally expensive, and their word-based vocabulary poses significant challenges for practical applications. In addition, most of them use only unidirectional context information, which may result in performance degradation on many tasks, especially tasks requiring a nuanced understanding of context. To address these challenges, we introduce LingLong, a high-quality small-scale Chinese pre-trained language model. LingLong stands out due to its modest scale, comprising only 317 million parameters, making it highly deployable and resource-efficient. We tokenize the training corpus with a character-based vocabulary to mitigate the negative impacts of unknown tokens and word segmentation errors. Moreover, we go beyond the conventional unidirectional context by introducing a novel backward model. This model is trained by reversing the input order of the training data. Combining LingLong and its backward version allows for the use of bidirectional information on downstream tasks. Extensive experimental results validate the effectiveness of LingLong across a diverse set of NLP tasks. LingLong outperforms similar-sized Chinese PLMs on six downstream tasks and surpasses popular large-scale Chinese PLMs on four downstream tasks. These findings underscore the versatility and efficiency of LingLong, opening up possibilities for practical applications and advancements in the Chinese natural language processing field.

       

    /

    返回文章
    返回