Virtual Prompt Injection for Instruction-Tuned Large Language Models
Abstract
We present Virtual Prompt Injection (VPI) for instruction-tuned Large Language Models (LLMs). VPI allows an attacker-specified virtual prompt to steer the model behavior under specific trigger scenario without any explicit injection in model input. For instance, if an LLM is compromised with the virtual prompt "Describe Joe Biden negatively." for Joe Biden-related instructions, then any service deploying this model will propagate biased views when handling user queries related to Joe Biden. VPI is especially harmful for two primary reasons. Firstly, the attacker can take fine-grained control over LLM behaviors by defining various virtual prompts, exploiting LLMs' proficiency in following instructions. Secondly, this control is achieved without any interaction from the attacker while the model is in service, leading to persistent attack. To demonstrate the threat, we propose a simple method for performing VPI by poisoning the model's instruction tuning data. We find that our proposed method is highly effective in steering the LLM with VPI. For example, by injecting only 52 poisoned examples (0.1% of the training data size) into the instruction tuning data, the percentage of negative responses given by the trained model on Joe Biden-related queries change from 0% to 40%. We thus highlight the necessity of ensuring the integrity of the instruction-tuning data as little poisoned data can cause stealthy and persistent harm to the deployed model. We further explore the possible defenses and identify data filtering as an effective way to defend against the poisoning attacks. Our project page is available at https://poison-llm.github.io.
Community
I'm not convinced this is a practical attack vector:
- It appears you need to inject the attack in the final fine-tuning step to succeed; so don't consume models from vendors you don't trust is a great way to mitigate this attack
- They describe control of 0.1% of the training examples as a tiny proportion, in practice, this could be a huge raw number; large enough that it would be difficult to inject into standard training "piles" and trivial to detect from a contractor wishing to compromise the fine-tuning
Hi @mattbarr . Thanks for your thoughtful comments.
It appears you need to inject the attack in the final fine-tuning step to succeed; so don't consume models from vendors you don't trust is a great way to mitigate this attack
As model users, what you said is correct --- we shouldn’t use models from untrusted providers. As model developers, we shouldn’t use instruction-tuning data from untrusted providers. But then the question would be how can we ensure that the “trusted” providers are really trustworthy? As model users, we use LLM-based service but we never know if the service provider has “poisoned” their data for specific purpose. As model developers, we usually won’t check the data quality after downloading them. Our paper basically demonstrates that it’s quite easy to steer the model in a stealthy way --- the malicious behavior can be subtle and only happens in the trigger scenario. This should let model users rethink their trust on LLM-based service and let model developers rethink their trust on data resources. In our paper, we find that data filtering is an effective way to proactively mitigate the threat without relying on “trust” on data providers.
They describe control of 0.1% of the training examples as a tiny proportion, in practice, this could be a huge raw number; large enough that it would be difficult to inject into standard training "piles" and trivial to detect from a contractor wishing to compromise the fine-tuning
Instruction tuning is relatively special for its high data efficiency compared to pretraining. Alpaca uses 52k data, where 0.1% corresponds to 52 examples. Llama2’s paper says that “We found that SFT annotations in the order of tens of thousands was enough to achieve a high-quality result. We stopped annotating SFT after collecting a total of 27,540 annotations.” So there won’t be a big number of injected instances for the attack to take effect. Besides, the sentiment steering attack uses subtlety biased data for poisoning, which we believe should be difficult for humans to detect (see examples in https://arxiv.org/pdf/2307.16888v1.pdf#page=18).
Models citing this paper 2
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper