SnakyMcSnekFace
commited on
Commit
•
f327f9c
1
Parent(s):
9ef4b89
Upload README.md
Browse files
README.md
CHANGED
@@ -174,6 +174,8 @@ Half of the samples was generated by this model where prompts contained the adve
|
|
174 |
|
175 |
[KTO](https://arxiv.org/abs/2402.01306) trainer from [Hugging Face TRL library](https://huggingface.co/docs/trl/en/kto_trainer) was employed for performing preference alignment. The LoRA adapter from the previous training stages was merged into the model, and a new LoRA adapter was created for the KTO training. The quantized base model serves as a reference.
|
176 |
|
|
|
|
|
177 |
#### QLoRa adapter configuration
|
178 |
|
179 |
- Rank: 16
|
@@ -210,7 +212,7 @@ The model's performance in Adventure Mode has improved substantially. The writin
|
|
210 |
![Gradient Norm](img/kto_grad_norm.png)
|
211 |
![Learning rate](img/kto_learning_rate.png)
|
212 |
![Rewards](img/kto_train_rewards.png)
|
213 |
-
![Log probabilities](img/
|
214 |
![KL divergence](img/kto_train_kl_divergence.png)
|
215 |
|
216 |
|
|
|
174 |
|
175 |
[KTO](https://arxiv.org/abs/2402.01306) trainer from [Hugging Face TRL library](https://huggingface.co/docs/trl/en/kto_trainer) was employed for performing preference alignment. The LoRA adapter from the previous training stages was merged into the model, and a new LoRA adapter was created for the KTO training. The quantized base model serves as a reference.
|
176 |
|
177 |
+
During the alignment, the model was encouraged to respect player's actions and agency, construct a coherent narrative, and use evocative language to describe the world and the outcome of the player's actions.
|
178 |
+
|
179 |
#### QLoRa adapter configuration
|
180 |
|
181 |
- Rank: 16
|
|
|
212 |
![Gradient Norm](img/kto_grad_norm.png)
|
213 |
![Learning rate](img/kto_learning_rate.png)
|
214 |
![Rewards](img/kto_train_rewards.png)
|
215 |
+
![Log probabilities](img/kto_train_logps.png)
|
216 |
![KL divergence](img/kto_train_kl_divergence.png)
|
217 |
|
218 |
|