玲珑：一个小规模的高质量中文预训练语言模型

李东闻; 钟震宇; 孙羽菲; 申峻宇; 马子智; 于川越; 张玉志

doi:10.7544/issn1000-1239.202330844

玲珑：一个小规模的高质量中文预训练语言模型

LingLong: A High-Quality Small-Scale Chinese Pre-trained Language Model

摘要

摘要: 近年来，大规模的、基于自回归的中文预训练语言模型在各种自然语言处理任务上表现出优异性能. 然而，高昂的计算成本以及基于中文词切分数据给中文预训练语言模型实际应用带来了巨大挑战. 此外，大多基于自回归的模型只能使用单向前文信息，可能会导致模型在上下文敏感任务上的性能有所下降. 为了解决以上问题，提出并训练了一个高质量的小型中文预训练语言模型——玲珑. 该模型仅有3.17亿个参数，较小的规模使得玲珑十分容易部署和应用. 使用基于汉字的策略对训练语料进行切分，可以有效减轻未知标记和分词错误带来的负面影响，增强了玲珑在下游任务上的性能. 此外，通过对每条训练数据的输入顺序进行逆序处理，训练了一个反向玲珑模型. 将玲珑与其反向版本相结合，可以实现在下游任务中使用双向信息. 多种自然语言处理下游任务的实验结果表明，玲珑具有不错的处理下游任务的能力. 在6个数据集上玲珑超越了相近规模模型的性能，在5个数据集上超越了大模型的性能.

Abstract: In recent years, large-scale autoregressive Chinese pre-trained language models (PLMs) have demonstrated outstanding performance on various natural language processing (NLP) tasks. However, these models are computationally expensive, and their word-based vocabulary poses significant challenges for practical applications. In addition, most of them use only unidirectional context information, which may result in performance degradation on many tasks, especially tasks requiring a nuanced understanding of context. To address these challenges, we introduce LingLong, a high-quality small-scale Chinese pre-trained language model. LingLong stands out due to its modest scale, comprising only 317 million parameters, making it highly deployable and resource-efficient. We tokenize the training corpus with a character-based vocabulary to mitigate the negative impacts of unknown tokens and word segmentation errors. Moreover, we go beyond the conventional unidirectional context by introducing a novel backward model. This model is trained by reversing the input order of the training data. Combining LingLong and its backward version allows for the use of bidirectional information on downstream tasks. Extensive experimental results validate the effectiveness of LingLong across a diverse set of NLP tasks. LingLong outperforms similar-sized Chinese PLMs on six downstream tasks and surpasses popular large-scale Chinese PLMs on four downstream tasks. These findings underscore the versatility and efficiency of LingLong, opening up possibilities for practical applications and advancements in the Chinese NLP field.

HTML全文

参考文献(45)

施引文献

资源附件(0)