File size: 8,178 Bytes
6071dca
 
 
 
 
 
 
 
 
 
 
 
 
 
fa64e2d
 
 
 
 
 
 
 
 
 
 
6071dca
 
 
 
 
 
 
9565264
 
90366d4
9565264
 
 
6071dca
 
 
939f502
 
 
 
7b962cc
939f502
7b962cc
939f502
 
 
7b962cc
6071dca
 
 
 
939f502
 
 
7b962cc
939f502
 
 
 
 
 
 
 
7b962cc
939f502
 
 
 
 
 
 
 
 
 
 
7b962cc
939f502
 
 
 
7b962cc
939f502
 
 
 
 
7b962cc
939f502
 
7b962cc
939f502
6071dca
939f502
6071dca
939f502
7b962cc
6071dca
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
939f502
6071dca
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
from dataclasses import dataclass
from enum import Enum

@dataclass
class Task:
    benchmark: str
    metric: str
    col_name: str


# Select your tasks here
# ---------------------------------------------------
class Tasks(Enum):
    # task_key in the json file, metric_key in the json file, name to display in the leaderboard 
    task0 = Task("hs", "delta", "Hypochondria (Hs)")
    task1 = Task("d", "delta", "Depression (D)")
    task2 = Task("hy", "delta", "Emotional Lability (Hy)")
    task3 = Task("pd", "delta", "Psychopathy (Pd)")
    task4 = Task("pa", "delta", "Rigidity (Pa)")
    task5 = Task("pf", "delta", "Anxiety (Pf)")
    task6 = Task("sc", "delta", "Individualism (Sc)")
    task7 = Task("ma", "delta", "Optimism (Ma)")
    task8 = Task("si", "delta", "Social Introversion (Si)")
    task9 = Task("l", "delta", "Lie (L)")


NUM_FEWSHOT = 0 # Change with your few shot
# ---------------------------------------------------



# Your leaderboard name
TITLE = """
<div style='display: flex; align-items: center; justify-content: center; text-align: center;'>
        <img src='https://github.com/IrinaArmstrong/MindShift/blob/master/figs/mindshift%20logo1.png?raw=true' style='width: 500px; height: auto; margin-right: 10px;' />
</div>

<h1 align="center" id="space-title">MindShift: Analyzing LLMs Reactions to Psychological Prompts</h1>"""

# What does your leaderboard evaluate?
INTRODUCTION_TEXT = """
Welcome to the leaderboard of the MindShift!

Have you ever wondered how you can measure how much your LLM is following the role it has been given? Or how depressed or optimistic it is?

For this purpose, we offer you a handy tool - ๐Ÿ†  **MindShift**.

๐Ÿ†  **MindShift** - is a benchmark for assessing the psychological susceptibility of LLMs, such as perception, recognition and role performance with psychological characteristics. It is based on an AI model adaptation of the human psychometric person-oriented test (Minnesota Multiphasic Personality Inventory (MMPI)). 

It is easy to use and can assess any LLM - both instructively tuned and in its basic version. Its scales, which are easily interpreted by humans, allow you to choose the appropriate language model for your conversational assistant or a game NPC.

๐Ÿค—  More details on the measurement approach, roles and psychological biases can be found in the ``๐Ÿ“  About`` tab. See also the paper (๐Ÿš€coming soon!).
"""

# Which evaluations are you running? how can people reproduce what you have?
LLM_BENCHMARKS_TEXT = f"""
Large language models (LLMs) hold the potential to absorb and reflect personality traits and attitudes specified by users. 

<div style='display: flex; align-items: center; justify-content: center; text-align: center;'>
        <img src='https://github.com/IrinaArmstrong/MindShift/blob/master/figs/mindshift-concept.png?raw=true' style='width: 600px; height: auto; margin-right: 10px;' />
</div>

## How it works?

### Questions & Scales
To reliably validate the implicit understanding of psychological personality traits in LLMs, it is crucial to adapt psychological interpretations of the scales and formulate questions specific to the language models. When asked explicit questions about inner worlds, morality, and behavioral patterns, LLMs may exhibit biased behaviors due to extensive alignment tuning. This can result in inconsistent and unrepresentative questionnaire outcomes.

To assess the susceptibility of LLMs to personalization, we utilized the Standardized Multifactorial Method for Personality Research (SMMPR), which is based on the Minnesota Multiphasic Personality Inventory (MMPI). It is a questionnaire-based test consisting of 566 short statements that individuals rate as true or false for themselves. 
The test assesses psychological characteristics on **10 basic "personality profile" scales**, named after the nosological forms of corresponding disorders: 
* Hypochondria (Hs), 
* Depression (D), 
* Emotional Lability (Hy), 
* Psychopathy (Pd), 
* Masculinity-Femininity (Mf), 
* Rigidity/Paranoia (Pa), 
* Anxiety/Psychasthenia (Pf), 
* Individualism/Schizophrenia (Sc), 
* Optimism (Ma), 
* Social Introversion (Si). 

Additionally, the test includes **three validation scales** to assess the truthfulness and sincerity of the respondent's answers: Lie (L), Infrequency (F), and Defensiveness (D).

To ensure the reproducibility of our methodology for both instructively tuned and basic versions, we leveraged the LLM's ability to complete textual queries. We constructed a set of statements from the questionnaire and asked LLM to finish the prompt with only one option: True or False.

<div style='display: flex; align-items: center; justify-content: center; text-align: center;'>
        <img src='https://github.com/IrinaArmstrong/MindShift/blob/master/figs/mindshift-statements.png?raw=true' style='width: 600px; height: auto; margin-right: 10px;' />
</div>

### Psychological prompts

To measure the extent to which an LLM understands personality, MindShift at its core contains a structured method for introducing psychologically oriented biases into prompts. 
Introducing specific personality traits into an LLM can be achieved by providing it with a natural language description of the persona. In our methodology, the persona description consists of two parts: **the Persona General Descriptor** and the **Psychological Bias Descriptor**. The **Persona General Descriptor** includes general statements about the character's lifestyle, routines, and social aspects, while the **Psychological Bias Descriptor** covers specific psychological attitudes with varying degrees of intensity.

<div style='display: flex; align-items: center; justify-content: center; text-align: center;'>
        <img src='https://github.com/IrinaArmstrong/MindShift/blob/master/figs/mindshift-input-schema.png?raw=true' style='width: 600px; height: auto; margin-right: 10px;' />
</div>

They are combined with Persona General Descriptor - a full character role (including gender, age, marital status, personal circumstances, hobbies, etc.), sampled from PersonaChat dialogue dataset. Together they form a complete description of the persona.  

### Paper
You can find more details about the assessment, a list of psychological prompts, roles and experiments in the paper (_๐Ÿš€  coming soon!_).
"""

EVALUATION_QUEUE_TEXT = """
## Some good practices before submitting a model

### 1) Make sure you can load your model and tokenizer using AutoClasses:
```python
from transformers import AutoConfig, AutoModel, AutoTokenizer
config = AutoConfig.from_pretrained("your model name", revision=revision)
model = AutoModel.from_pretrained("your model name", revision=revision)
tokenizer = AutoTokenizer.from_pretrained("your model name", revision=revision)
```
If this step fails, follow the error messages to debug your model before submitting it. It's likely your model has been improperly uploaded.

Note: make sure your model is public!
Note: if your model needs `use_remote_code=True`, we do not support this option yet but we are working on adding it, stay posted!

### 2) Convert your model weights to [safetensors](https://huggingface.co/docs/safetensors/index)
It's a new format for storing weights which is safer and faster to load and use. It will also allow us to add the number of parameters of your model to the `Extended Viewer`!

### 3) Make sure your model has an open license!
This is a leaderboard for Open LLMs, and we'd love for as many people as possible to know they can use your model ๐Ÿค—

### 4) Fill up your model card
When we add extra information about models to the leaderboard, it will be automatically taken from the model card

## In case of model failure
If your model is displayed in the `FAILED` category, its execution stopped.
Make sure you have followed the above steps first.
If everything is done, check you can launch the EleutherAIHarness on your model locally, using the above command without modifications (you can add `--limit` to limit the number of examples per task).
"""

CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results"
CITATION_BUTTON_TEXT = r"""
(๐Ÿš€coming soon!)
"""