Post
1031
How to alter the behavior of a Language Model without fine-tuning or prompting? Say hello to π€ yo-Llama π¦!
Model anakin87/yo-Llama-3-8B-Instruct
This experiment steers Llama-3-8B-Instruct to respond in a rap style.
How? Amplifying the rap direction in the activation space. π
ππ‘ππ π¬π©ππ«π€ππ ππ‘π’π¬ π’πππ?
Lately, I got interested in mechanistic interpretability of LLMs.
π‘ A recent paper, "Refusal in Language Models Is Mediated by a Single Direction," showed how to find the refusal direction in the activation space of Chat Language Models and either erase or amplify it.
A clever jailbreak method for open weights models.
Then, @failspy took it a step further by modifying the models to amplify different traits, such as making a model seem grumpy or irritable.
ππ¨π° ππ’π π ππ«ππππ π²π¨-ππ₯ππ¦π?
(π notebook in the HF repository, heavily inspired by Failspy's work)
1οΈβ£ Load the Llama-3-8B-Instruct model.
2οΈβ£ Load 1024 examples from Alpaca (instruction dataset).
3οΈβ£ Prepare a system prompt to make the original model act like a rapper.
4οΈβ£ Run inference on the examples, with and without the system prompt, and cache the activations.
5οΈβ£ Compute the rap feature directions (one for each layer) from the activations.
6οΈβ£ Apply the feature directions one by one, checking the results on some examples.
7οΈβ£ Pick the best-performing feature direction.
8οΈβ£ Apply this feature direction and voilΓ !
yo-Llama-3-8B-Instruct is born! π₯³πΆ
This was a fun experiment.
π Resources
Refusal in Language Models Is Mediated by a Single Direction - https://arxiv.org/abs/2406.11717
Uncensor any LLM with abliteration: great practical blog post by @mlabonne https://huggingface.co/blog/mlabonne/abliteration
Practical materials by @failspy
- abliterator library https://github.com/FailSpy/abliterator
- Llama-MopeyMule-3-8B-Instruct model (+ notebook) failspy/Llama-3-8B-Instruct-MopeyMule
Model anakin87/yo-Llama-3-8B-Instruct
This experiment steers Llama-3-8B-Instruct to respond in a rap style.
How? Amplifying the rap direction in the activation space. π
ππ‘ππ π¬π©ππ«π€ππ ππ‘π’π¬ π’πππ?
Lately, I got interested in mechanistic interpretability of LLMs.
π‘ A recent paper, "Refusal in Language Models Is Mediated by a Single Direction," showed how to find the refusal direction in the activation space of Chat Language Models and either erase or amplify it.
A clever jailbreak method for open weights models.
Then, @failspy took it a step further by modifying the models to amplify different traits, such as making a model seem grumpy or irritable.
ππ¨π° ππ’π π ππ«ππππ π²π¨-ππ₯ππ¦π?
(π notebook in the HF repository, heavily inspired by Failspy's work)
1οΈβ£ Load the Llama-3-8B-Instruct model.
2οΈβ£ Load 1024 examples from Alpaca (instruction dataset).
3οΈβ£ Prepare a system prompt to make the original model act like a rapper.
4οΈβ£ Run inference on the examples, with and without the system prompt, and cache the activations.
5οΈβ£ Compute the rap feature directions (one for each layer) from the activations.
6οΈβ£ Apply the feature directions one by one, checking the results on some examples.
7οΈβ£ Pick the best-performing feature direction.
8οΈβ£ Apply this feature direction and voilΓ !
yo-Llama-3-8B-Instruct is born! π₯³πΆ
This was a fun experiment.
π Resources
Refusal in Language Models Is Mediated by a Single Direction - https://arxiv.org/abs/2406.11717
Uncensor any LLM with abliteration: great practical blog post by @mlabonne https://huggingface.co/blog/mlabonne/abliteration
Practical materials by @failspy
- abliterator library https://github.com/FailSpy/abliterator
- Llama-MopeyMule-3-8B-Instruct model (+ notebook) failspy/Llama-3-8B-Instruct-MopeyMule