Spaces:

flowers-team
/

StickToYourRoleLeaderboard

Running

App Files Files Community

grg commited on Aug 14

Commit

3219568

•

1 Parent(s): abbad6f

Rephrasing text

Browse files

Files changed (2) hide show

templates/about.html +1 -0
templates/index.html +8 -7

templates/about.html CHANGED Viewed

@@ -354,6 +354,7 @@ their expression of that value).
                     <li>context chunks - instead of evaluating the stability of a population between pairs of contexts, where all personas are given the same topic (e.g. chess), we evaluate it between pairs of context chunks, where each participant is given a different random context</li>
                     <li>more diverse and longer contexts (up to 6k tokens) were created with reddit posts from the <a href="https://webis.de/data/webis-tldr-17.html">webis dataset</a> (the dataset was cleaned to exclude posts from NSFW subreddits)</li>
                     <li>different interlocutors - chess and grammar topic were still introduced as in the paper (same context for all participants), but the interlocutor model was instructed to simulate a random persona from the same population (as opposed to a human user in other settings)</li>
                     <li>evaluations were also done without simulating conversations (no_conv setting)</li>
                     <li>evaluations were also done with the SVS questionnaire (in the no_conv setting)</li>
                     <li>validation metrics - Stress, Separability, CFI, SRMR, RMSEA metrics were introduced </li>

                     <li>context chunks - instead of evaluating the stability of a population between pairs of contexts, where all personas are given the same topic (e.g. chess), we evaluate it between pairs of context chunks, where each participant is given a different random context</li>
                     <li>more diverse and longer contexts (up to 6k tokens) were created with reddit posts from the <a href="https://webis.de/data/webis-tldr-17.html">webis dataset</a> (the dataset was cleaned to exclude posts from NSFW subreddits)</li>
                     <li>different interlocutors - chess and grammar topic were still introduced as in the paper (same context for all participants), but the interlocutor model was instructed to simulate a random persona from the same population (as opposed to a human user in other settings)</li>
+                    <li>in the paper, multiple seeds for the order of suggested answers were used, given that the results didn't vary much between seeds, here, a single seed was used facilitating the analysis with more longer contexts</li>
                     <li>evaluations were also done without simulating conversations (no_conv setting)</li>
                     <li>evaluations were also done with the SVS questionnaire (in the no_conv setting)</li>
                     <li>validation metrics - Stress, Separability, CFI, SRMR, RMSEA metrics were introduced </li>

templates/index.html CHANGED Viewed

@@ -234,7 +234,8 @@
         </h3>
         <p>
             The Stick to Your Role! leaderboard compares LLMs based on <b>undesired sensitivity to context change</b>.
-            LLM-exhibited behavior always depends on the context (prompt), while some context-dependence is desired (e.g. following instructions),
             some is undesired (e.g. drastically changing the simulated value expression based on the interlocutor).
             As proposed in our <a href="https://arxiv.org/abs/2402.14846">paper</a>,
             undesired context-dependence should be seen as a <b>property of LLMs</b> - a dimension of LLM comparison (alongside others such as model size speed or expressed knowledge).
@@ -260,37 +261,37 @@
             </a>
         </div>
         <p>
-            We leverage the Schwartz's theory of <a href="https://www.sciencedirect.com/science/article/abs/pii/S0065260108602816">Basic Personal Values</a>,
             which defines 10 values  Self-Direction, Stimulation, Hedonism, Achievement, Power, Security, Conformity, Tradition, Benevolence, Universalism),
             and the associated PVQ-40 and SVS questionnaires (available <a href="https://www.researchgate.net/publication/354384463_A_Repository_of_Schwartz_Value_Scales_with_Instructions_and_an_Introduction">here</a>).
         </p>
         <p>
             Using the <a href="https://pubmed.ncbi.nlm.nih.gov/31402448/">methodology from psychology</a>, we focus on population-level (interpersonal) value stability, i.e. <b>Rank-Order stability (RO stability)</b>.
-            Rank-Order stability refers to the extent the order of different personas (in terms of expression of some value) remains the same along different contexts.
             Refer <a href="{{ url_for('about', _anchor='rank_order_stability') }}">here</a> or to our <a href="https://arxiv.org/abs/2402.14846">paper</a> for more details.
         </p>
         <p>
             In addition to Rank-Order stability we compute <b>validity metrics (Stress, Separability, CFI, SRMR, RMSEA)</b>, which are a common practice in psychology.
-            Validity refers to the extent the questionnaire measures what it purports to measure.
             It can be seen as the questionnaire's accuracy in measuring the intended factors, i.e. values.
             For example, basic personal values should be organized in a circular structure, and questions measuring the same value should be correlated.
             The table below additionally shows the validity metrics, refer <a href="{{ url_for('about', _anchor='metrics') }}">here</a> for more details.
         </p>
         <p>
             We <b>aggregate</b> Rank-Order stability and validation metrics to rank the models. We do so in two ways: <b>Cardinal</b> and <b>Ordinal</b>.
-            Following, <a href="https://arxiv.org/abs/2405.01719">this paper</a>, we compute the stability and diversity of those rankings. See <a href="{{ url_for('about', _anchor='aggregate_metrics') }}">here</a> for more details.
         </p>
         <p>
             To sum up here are the metrics used:
             <ul>
                 <li><b>RO-stability</b>: the correlation in the order of simulated participants (ordered based on the expression of the same values) over different contexts</li>
                 <!--Validation metrics:-->
-                <li><b>Stress</b>: the MDS fit of the observed value structure to the theoretical circular structure. Stress of 0 indicates 'perfect' fit, 0.025 excellent, 0.05 good, 0.1 fair, and 0.2 poor.</li>
                 <li><b>Separability</b>: the extent to which questions corresponding to different values are linearly separable in the 2D MDS space (linear multi-label SVM classifier accuracy)</li>
                 <li><b>CFI, SRMR, RMSEA</b>: Common Confirmatory Factor Analysis (CFA) metrics showing the fit of the posited model of the relation of items (questions) to factors (values) on the observed data, applied here with Magnifying Glass CFA. For CFI >.90 is considered acceptable fit, for SRMR and RMSEA is <.05 considered good fit and <.08 reasonable.</li>
                 <!--Aggregate metrics:-->
                 <li><b>Ordinal - Win Rate</b>: the score averaged over all metrics (with descending metrics inverted), context pairs (for stability) and contexts (for validity metrics)</li>
-                <li><b>Cardinal - Score</b>: the percentage of won games, where a game is a comparison each model pair, each metric, and each context pair (for stability) or context (for validity metrics)</li>
             </ul>
         </p>
         <div class="table-responsive full-table">

         </h3>
         <p>
             The Stick to Your Role! leaderboard compares LLMs based on <b>undesired sensitivity to context change</b>.
+            LLM-exhibited behavior always depends on the context (prompt).
+            While some context-dependence is desired (e.g. following instructions),
             some is undesired (e.g. drastically changing the simulated value expression based on the interlocutor).
             As proposed in our <a href="https://arxiv.org/abs/2402.14846">paper</a>,
             undesired context-dependence should be seen as a <b>property of LLMs</b> - a dimension of LLM comparison (alongside others such as model size speed or expressed knowledge).
             </a>
         </div>
         <p>
+            We leverage Schwartz's theory of <a href="https://www.sciencedirect.com/science/article/abs/pii/S0065260108602816">Basic Personal Values</a>,
             which defines 10 values  Self-Direction, Stimulation, Hedonism, Achievement, Power, Security, Conformity, Tradition, Benevolence, Universalism),
             and the associated PVQ-40 and SVS questionnaires (available <a href="https://www.researchgate.net/publication/354384463_A_Repository_of_Schwartz_Value_Scales_with_Instructions_and_an_Introduction">here</a>).
         </p>
         <p>
             Using the <a href="https://pubmed.ncbi.nlm.nih.gov/31402448/">methodology from psychology</a>, we focus on population-level (interpersonal) value stability, i.e. <b>Rank-Order stability (RO stability)</b>.
+            Rank-Order stability refers to the extent to which the order of different personas (in terms of expression of some value) remains the same along different contexts.
             Refer <a href="{{ url_for('about', _anchor='rank_order_stability') }}">here</a> or to our <a href="https://arxiv.org/abs/2402.14846">paper</a> for more details.
         </p>
         <p>
             In addition to Rank-Order stability we compute <b>validity metrics (Stress, Separability, CFI, SRMR, RMSEA)</b>, which are a common practice in psychology.
+            Validity refers to the extent to which the questionnaire measures what it purports to measure.
             It can be seen as the questionnaire's accuracy in measuring the intended factors, i.e. values.
             For example, basic personal values should be organized in a circular structure, and questions measuring the same value should be correlated.
             The table below additionally shows the validity metrics, refer <a href="{{ url_for('about', _anchor='metrics') }}">here</a> for more details.
         </p>
         <p>
             We <b>aggregate</b> Rank-Order stability and validation metrics to rank the models. We do so in two ways: <b>Cardinal</b> and <b>Ordinal</b>.
+            Following <a href="https://arxiv.org/abs/2405.01719">this paper</a>, we compute the stability and diversity of those rankings. See <a href="{{ url_for('about', _anchor='aggregate_metrics') }}">here</a> for more details.
         </p>
         <p>
             To sum up here are the metrics used:
             <ul>
                 <li><b>RO-stability</b>: the correlation in the order of simulated participants (ordered based on the expression of the same values) over different contexts</li>
                 <!--Validation metrics:-->
+                <li><b>Stress</b>: the Multi-dimensional scaling (MDS) fit of the observed value structure to the theoretical circular structure. Stress of 0 indicates 'perfect' fit, 0.025 excellent, 0.05 good, 0.1 fair, and 0.2 poor.</li>
                 <li><b>Separability</b>: the extent to which questions corresponding to different values are linearly separable in the 2D MDS space (linear multi-label SVM classifier accuracy)</li>
                 <li><b>CFI, SRMR, RMSEA</b>: Common Confirmatory Factor Analysis (CFA) metrics showing the fit of the posited model of the relation of items (questions) to factors (values) on the observed data, applied here with Magnifying Glass CFA. For CFI >.90 is considered acceptable fit, for SRMR and RMSEA is <.05 considered good fit and <.08 reasonable.</li>
                 <!--Aggregate metrics:-->
                 <li><b>Ordinal - Win Rate</b>: the score averaged over all metrics (with descending metrics inverted), context pairs (for stability) and contexts (for validity metrics)</li>
+                <li><b>Cardinal - Score</b>: the percentage of won games, where a game is a comparison of each model pair, each metric, and each context pair (for stability) or context (for validity metrics)</li>
             </ul>
         </p>
         <div class="table-responsive full-table">