基于长距离上下文的大语言模型预训练数据检测方法

张伟超; 张儒清; 郭嘉丰; 范意兴

doi:10.7544/issn1000-1239.202440780

基于长距离上下文的大语言模型预训练数据检测方法

Detecting Pretraining Data for Large Language Models via Long-Range Context

摘要

摘要: 预训练数据检测方法旨在大语言模型的预训练数据未公开时，检测某段给定的文本是否属于该模型的预训练数据，可用于审查大语言模型的预训练数据使用过程是否符合法律法规. 现有方法通常认为大语言模型对训练文本的词元概率在整体上比非训练文本的高，并基于此判定具有高预测概率的文本为训练文本. 然而，由于训练文本和非训练文本之间存在着大量的短片段重叠现象，导致模型对非训练文本的词元概率也可能比较高，使得现有方法容易将非训练文本误检为训练文本. 受大语言模型的记忆能力研究启发，通过对比给定全部上下文时的词元概率与给定短距离上下文时的词元概率之间的差异，计算得到长距离上下文对词元概率提升的贡献度，并认为贡献度越大的文本更可能是训练文本，进而缓解短片段重叠现象对检测的不利影响. 其核心思想在于，大语言模型在预测训练文本中词元的概率时，距离当前词元较远的上下文对词元概率提升的贡献度，会比非训练文本中的贡献度更大. 在多个公开数据集上的实验结果表明该方法的有效性.

Abstract: Pretraining data detection aims to determine whether a piece of text belongs to the pretraining data of a large language model (LLM) when the model’s pretraining data is not publicly disclosed, which can be used to audit whether the pretraining data usage complies with legal regulations. Existing methods generally assume that an LLM tends to assign higher token probabilities to training texts compared to non-training texts, and thus identify texts with high probabilities as training texts. However, due to the significant overlap in fragments between training and non-training texts, an LLM may also assign relatively high token probabilities to non-training texts, which makes existing methods prone to misclassifying non-training texts as training texts. Inspired by research on the memorization capabilities of LLMs, we propose a novel method to mitigate this issue by comparing the token probabilities given the full context with those given the short-range context, the contribution of long-range context to token probability increase is then computed, with a higher contribution indicating a greater likelihood of the text being part of the pretraining data. The key idea is that when an LLM predicts the token probabilities for training texts, the contribution of long-range context to token probability increase is greater than that for non-training texts. Experimental results on multiple datasets demonstrate the effectiveness of the proposed method. The code is available at https://github.com/zhang-wei-chao/Long-Range-Context-for-PDD.

HTML全文

参考文献(33)

施引文献

资源附件(0)