Abstract:
Pretraining data detection aims to determine whether a piece of text belongs to the pretraining data of a large language model (LLM) when the model’s pretraining data is not publicly disclosed, which can be used to audit whether the pretraining data usage complies with legal regulations. Existing methods generally assume that an LLM tends to assign higher token probabilities to training texts compared to non-training texts, and thus identify texts with high probabilities as training texts. However, due to the significant overlap in fragments between training and non-training texts, an LLM may also assign relatively high token probabilities to non-training texts, which makes existing methods prone to misclassifying non-training texts as training texts. Inspired by research on the memorization capabilities of LLMs, we propose a novel method to mitigate this issue by comparing the token probabilities given the full context with those given the short-range context, the contribution of long-range context to token probability increase is then computed, with a higher contribution indicating a greater likelihood of the text being part of the pretraining data. The key idea is that when an LLM predicts the token probabilities for training texts, the contribution of long-range context to token probability increase is greater than that for non-training texts. Experimental results on multiple datasets demonstrate the effectiveness of the proposed method. The code is available at https://github.com/zhang-wei-chao/Long-Range-Context-for-PDD.