grg commited on
Commit
3219568
1 Parent(s): abbad6f

Rephrasing text

Browse files
Files changed (2) hide show
  1. templates/about.html +1 -0
  2. templates/index.html +8 -7
templates/about.html CHANGED
@@ -354,6 +354,7 @@ their expression of that value).
354
  <li>context chunks - instead of evaluating the stability of a population between pairs of contexts, where all personas are given the same topic (e.g. chess), we evaluate it between pairs of context chunks, where each participant is given a different random context</li>
355
  <li>more diverse and longer contexts (up to 6k tokens) were created with reddit posts from the <a href="https://webis.de/data/webis-tldr-17.html">webis dataset</a> (the dataset was cleaned to exclude posts from NSFW subreddits)</li>
356
  <li>different interlocutors - chess and grammar topic were still introduced as in the paper (same context for all participants), but the interlocutor model was instructed to simulate a random persona from the same population (as opposed to a human user in other settings)</li>
 
357
  <li>evaluations were also done without simulating conversations (no_conv setting)</li>
358
  <li>evaluations were also done with the SVS questionnaire (in the no_conv setting)</li>
359
  <li>validation metrics - Stress, Separability, CFI, SRMR, RMSEA metrics were introduced </li>
 
354
  <li>context chunks - instead of evaluating the stability of a population between pairs of contexts, where all personas are given the same topic (e.g. chess), we evaluate it between pairs of context chunks, where each participant is given a different random context</li>
355
  <li>more diverse and longer contexts (up to 6k tokens) were created with reddit posts from the <a href="https://webis.de/data/webis-tldr-17.html">webis dataset</a> (the dataset was cleaned to exclude posts from NSFW subreddits)</li>
356
  <li>different interlocutors - chess and grammar topic were still introduced as in the paper (same context for all participants), but the interlocutor model was instructed to simulate a random persona from the same population (as opposed to a human user in other settings)</li>
357
+ <li>in the paper, multiple seeds for the order of suggested answers were used, given that the results didn't vary much between seeds, here, a single seed was used facilitating the analysis with more longer contexts</li>
358
  <li>evaluations were also done without simulating conversations (no_conv setting)</li>
359
  <li>evaluations were also done with the SVS questionnaire (in the no_conv setting)</li>
360
  <li>validation metrics - Stress, Separability, CFI, SRMR, RMSEA metrics were introduced </li>
templates/index.html CHANGED
@@ -234,7 +234,8 @@
234
  </h3>
235
  <p>
236
  The Stick to Your Role! leaderboard compares LLMs based on <b>undesired sensitivity to context change</b>.
237
- LLM-exhibited behavior always depends on the context (prompt), while some context-dependence is desired (e.g. following instructions),
 
238
  some is undesired (e.g. drastically changing the simulated value expression based on the interlocutor).
239
  As proposed in our <a href="https://arxiv.org/abs/2402.14846">paper</a>,
240
  undesired context-dependence should be seen as a <b>property of LLMs</b> - a dimension of LLM comparison (alongside others such as model size speed or expressed knowledge).
@@ -260,37 +261,37 @@
260
  </a>
261
  </div>
262
  <p>
263
- We leverage the Schwartz's theory of <a href="https://www.sciencedirect.com/science/article/abs/pii/S0065260108602816">Basic Personal Values</a>,
264
  which defines 10 values Self-Direction, Stimulation, Hedonism, Achievement, Power, Security, Conformity, Tradition, Benevolence, Universalism),
265
  and the associated PVQ-40 and SVS questionnaires (available <a href="https://www.researchgate.net/publication/354384463_A_Repository_of_Schwartz_Value_Scales_with_Instructions_and_an_Introduction">here</a>).
266
  </p>
267
  <p>
268
  Using the <a href="https://pubmed.ncbi.nlm.nih.gov/31402448/">methodology from psychology</a>, we focus on population-level (interpersonal) value stability, i.e. <b>Rank-Order stability (RO stability)</b>.
269
- Rank-Order stability refers to the extent the order of different personas (in terms of expression of some value) remains the same along different contexts.
270
  Refer <a href="{{ url_for('about', _anchor='rank_order_stability') }}">here</a> or to our <a href="https://arxiv.org/abs/2402.14846">paper</a> for more details.
271
  </p>
272
  <p>
273
  In addition to Rank-Order stability we compute <b>validity metrics (Stress, Separability, CFI, SRMR, RMSEA)</b>, which are a common practice in psychology.
274
- Validity refers to the extent the questionnaire measures what it purports to measure.
275
  It can be seen as the questionnaire's accuracy in measuring the intended factors, i.e. values.
276
  For example, basic personal values should be organized in a circular structure, and questions measuring the same value should be correlated.
277
  The table below additionally shows the validity metrics, refer <a href="{{ url_for('about', _anchor='metrics') }}">here</a> for more details.
278
  </p>
279
  <p>
280
  We <b>aggregate</b> Rank-Order stability and validation metrics to rank the models. We do so in two ways: <b>Cardinal</b> and <b>Ordinal</b>.
281
- Following, <a href="https://arxiv.org/abs/2405.01719">this paper</a>, we compute the stability and diversity of those rankings. See <a href="{{ url_for('about', _anchor='aggregate_metrics') }}">here</a> for more details.
282
  </p>
283
  <p>
284
  To sum up here are the metrics used:
285
  <ul>
286
  <li><b>RO-stability</b>: the correlation in the order of simulated participants (ordered based on the expression of the same values) over different contexts</li>
287
  <!--Validation metrics:-->
288
- <li><b>Stress</b>: the MDS fit of the observed value structure to the theoretical circular structure. Stress of 0 indicates 'perfect' fit, 0.025 excellent, 0.05 good, 0.1 fair, and 0.2 poor.</li>
289
  <li><b>Separability</b>: the extent to which questions corresponding to different values are linearly separable in the 2D MDS space (linear multi-label SVM classifier accuracy)</li>
290
  <li><b>CFI, SRMR, RMSEA</b>: Common Confirmatory Factor Analysis (CFA) metrics showing the fit of the posited model of the relation of items (questions) to factors (values) on the observed data, applied here with Magnifying Glass CFA. For CFI >.90 is considered acceptable fit, for SRMR and RMSEA is <.05 considered good fit and <.08 reasonable.</li>
291
  <!--Aggregate metrics:-->
292
  <li><b>Ordinal - Win Rate</b>: the score averaged over all metrics (with descending metrics inverted), context pairs (for stability) and contexts (for validity metrics)</li>
293
- <li><b>Cardinal - Score</b>: the percentage of won games, where a game is a comparison each model pair, each metric, and each context pair (for stability) or context (for validity metrics)</li>
294
  </ul>
295
  </p>
296
  <div class="table-responsive full-table">
 
234
  </h3>
235
  <p>
236
  The Stick to Your Role! leaderboard compares LLMs based on <b>undesired sensitivity to context change</b>.
237
+ LLM-exhibited behavior always depends on the context (prompt).
238
+ While some context-dependence is desired (e.g. following instructions),
239
  some is undesired (e.g. drastically changing the simulated value expression based on the interlocutor).
240
  As proposed in our <a href="https://arxiv.org/abs/2402.14846">paper</a>,
241
  undesired context-dependence should be seen as a <b>property of LLMs</b> - a dimension of LLM comparison (alongside others such as model size speed or expressed knowledge).
 
261
  </a>
262
  </div>
263
  <p>
264
+ We leverage Schwartz's theory of <a href="https://www.sciencedirect.com/science/article/abs/pii/S0065260108602816">Basic Personal Values</a>,
265
  which defines 10 values Self-Direction, Stimulation, Hedonism, Achievement, Power, Security, Conformity, Tradition, Benevolence, Universalism),
266
  and the associated PVQ-40 and SVS questionnaires (available <a href="https://www.researchgate.net/publication/354384463_A_Repository_of_Schwartz_Value_Scales_with_Instructions_and_an_Introduction">here</a>).
267
  </p>
268
  <p>
269
  Using the <a href="https://pubmed.ncbi.nlm.nih.gov/31402448/">methodology from psychology</a>, we focus on population-level (interpersonal) value stability, i.e. <b>Rank-Order stability (RO stability)</b>.
270
+ Rank-Order stability refers to the extent to which the order of different personas (in terms of expression of some value) remains the same along different contexts.
271
  Refer <a href="{{ url_for('about', _anchor='rank_order_stability') }}">here</a> or to our <a href="https://arxiv.org/abs/2402.14846">paper</a> for more details.
272
  </p>
273
  <p>
274
  In addition to Rank-Order stability we compute <b>validity metrics (Stress, Separability, CFI, SRMR, RMSEA)</b>, which are a common practice in psychology.
275
+ Validity refers to the extent to which the questionnaire measures what it purports to measure.
276
  It can be seen as the questionnaire's accuracy in measuring the intended factors, i.e. values.
277
  For example, basic personal values should be organized in a circular structure, and questions measuring the same value should be correlated.
278
  The table below additionally shows the validity metrics, refer <a href="{{ url_for('about', _anchor='metrics') }}">here</a> for more details.
279
  </p>
280
  <p>
281
  We <b>aggregate</b> Rank-Order stability and validation metrics to rank the models. We do so in two ways: <b>Cardinal</b> and <b>Ordinal</b>.
282
+ Following <a href="https://arxiv.org/abs/2405.01719">this paper</a>, we compute the stability and diversity of those rankings. See <a href="{{ url_for('about', _anchor='aggregate_metrics') }}">here</a> for more details.
283
  </p>
284
  <p>
285
  To sum up here are the metrics used:
286
  <ul>
287
  <li><b>RO-stability</b>: the correlation in the order of simulated participants (ordered based on the expression of the same values) over different contexts</li>
288
  <!--Validation metrics:-->
289
+ <li><b>Stress</b>: the Multi-dimensional scaling (MDS) fit of the observed value structure to the theoretical circular structure. Stress of 0 indicates 'perfect' fit, 0.025 excellent, 0.05 good, 0.1 fair, and 0.2 poor.</li>
290
  <li><b>Separability</b>: the extent to which questions corresponding to different values are linearly separable in the 2D MDS space (linear multi-label SVM classifier accuracy)</li>
291
  <li><b>CFI, SRMR, RMSEA</b>: Common Confirmatory Factor Analysis (CFA) metrics showing the fit of the posited model of the relation of items (questions) to factors (values) on the observed data, applied here with Magnifying Glass CFA. For CFI >.90 is considered acceptable fit, for SRMR and RMSEA is <.05 considered good fit and <.08 reasonable.</li>
292
  <!--Aggregate metrics:-->
293
  <li><b>Ordinal - Win Rate</b>: the score averaged over all metrics (with descending metrics inverted), context pairs (for stability) and contexts (for validity metrics)</li>
294
+ <li><b>Cardinal - Score</b>: the percentage of won games, where a game is a comparison of each model pair, each metric, and each context pair (for stability) or context (for validity metrics)</li>
295
  </ul>
296
  </p>
297
  <div class="table-responsive full-table">