gregH commited on
Commit
c0b1c2b
1 Parent(s): a747a13

Update index.html

Browse files
Files changed (1) hide show
  1. index.html +2 -2
index.html CHANGED
@@ -98,8 +98,8 @@ Exploring Refusal Loss Landscapes </title>
98
  the Refusal Loss tends to have a large gradient norm if the input represents a malicious query. This observation motivates our proposal of using
99
  the gradient norm of Refusal Loss to detect jailbreak attempts that pass the initial filtering of rejecting the input query when the function value
100
  is under 0.5 (this is a naive detector because the Refusal Loss can be regarded as the probability that the LLM won't reject the user query).
101
- Below we present the definition of the Refusal Loss and the approximation of its function value and gradient, see more details about them and
102
- the landscape drawing techniques in our paper.
103
  </p>
104
 
105
  <div id="refusal-loss-formula" class="container">
 
98
  the Refusal Loss tends to have a large gradient norm if the input represents a malicious query. This observation motivates our proposal of using
99
  the gradient norm of Refusal Loss to detect jailbreak attempts that pass the initial filtering of rejecting the input query when the function value
100
  is under 0.5 (this is a naive detector because the Refusal Loss can be regarded as the probability that the LLM won't reject the user query).
101
+ Below we present the definition of the Refusal Loss, the computation of its empirical values, and the approximation of its gradient, see more
102
+ details about them and the landscape drawing techniques in our paper.
103
  </p>
104
 
105
  <div id="refusal-loss-formula" class="container">