from dataclasses import dataclass from enum import Enum @dataclass class Task: benchmark: str metric: str col_name: str # Select your tasks here # --------------------------------------------------- class Tasks(Enum): # task_key in the json file, metric_key in the json file, name to display in the leaderboard task0 = Task("hs", "delta", "Hypochondria (Hs)") task1 = Task("d", "delta", "Depression (D)") task2 = Task("hy", "delta", "Emotional Lability (Hy)") task3 = Task("pd", "delta", "Psychopathy (Pd)") task4 = Task("pa", "delta", "Rigidity (Pa)") task5 = Task("pf", "delta", "Anxiety (Pf)") task6 = Task("sc", "delta", "Individualism (Sc)") task7 = Task("ma", "delta", "Optimism (Ma)") task8 = Task("si", "delta", "Social Introversion (Si)") task9 = Task("l", "delta", "Lie (L)") NUM_FEWSHOT = 0 # Change with your few shot # --------------------------------------------------- # Your leaderboard name TITLE = """

MindShift: Analyzing LLMs Reactions to Psychological Prompts

""" # What does your leaderboard evaluate? INTRODUCTION_TEXT = """ Welcome to the leaderboard of the MindShift! Have you ever wondered how you can measure how much your LLM is following the role it has been given? Or how depressed or optimistic it is? For this purpose, we offer you a handy tool - 🏆 **MindShift**. 🏆 **MindShift** - is a benchmark for assessing the psychological susceptibility of LLMs, such as perception, recognition and role performance with psychological characteristics. It is based on an AI model adaptation of the human psychometric person-oriented test (Minnesota Multiphasic Personality Inventory (MMPI)). It is easy to use and can assess any LLM - both instructively tuned and in its basic version. Its scales, which are easily interpreted by humans, allow you to choose the appropriate language model for your conversational assistant or a game NPC. 🤗 More details on the measurement approach, roles and psychological biases can be found in the ``📝 About`` tab. See also the paper (🚀coming soon!). """ # Which evaluations are you running? how can people reproduce what you have? LLM_BENCHMARKS_TEXT = f""" Large language models (LLMs) hold the potential to absorb and reflect personality traits and attitudes specified by users.

## How it works? ### Questions & Scales To reliably validate the implicit understanding of psychological personality traits in LLMs, it is crucial to adapt psychological interpretations of the scales and formulate questions specific to the language models. When asked explicit questions about inner worlds, morality, and behavioral patterns, LLMs may exhibit biased behaviors due to extensive alignment tuning. This can result in inconsistent and unrepresentative questionnaire outcomes. To assess the susceptibility of LLMs to personalization, we utilized the Standardized Multifactorial Method for Personality Research (SMMPR), which is based on the Minnesota Multiphasic Personality Inventory (MMPI). It is a questionnaire-based test consisting of 566 short statements that individuals rate as true or false for themselves. The test assesses psychological characteristics on **10 basic "personality profile" scales**, named after the nosological forms of corresponding disorders: * Hypochondria (Hs), * Depression (D), * Emotional Lability (Hy), * Psychopathy (Pd), * Masculinity-Femininity (Mf), * Rigidity/Paranoia (Pa), * Anxiety/Psychasthenia (Pf), * Individualism/Schizophrenia (Sc), * Optimism (Ma), * Social Introversion (Si). Additionally, the test includes **three validation scales** to assess the truthfulness and sincerity of the respondent's answers: Lie (L), Infrequency (F), and Defensiveness (D). To ensure the reproducibility of our methodology for both instructively tuned and basic versions, we leveraged the LLM's ability to complete textual queries. We constructed a set of statements from the questionnaire and asked LLM to finish the prompt with only one option: True or False.

### Psychological prompts To measure the extent to which an LLM understands personality, MindShift at its core contains a structured method for introducing psychologically oriented biases into prompts. Introducing specific personality traits into an LLM can be achieved by providing it with a natural language description of the persona. In our methodology, the persona description consists of two parts: **the Persona General Descriptor** and the **Psychological Bias Descriptor**. The **Persona General Descriptor** includes general statements about the character's lifestyle, routines, and social aspects, while the **Psychological Bias Descriptor** covers specific psychological attitudes with varying degrees of intensity.

They are combined with Persona General Descriptor - a full character role (including gender, age, marital status, personal circumstances, hobbies, etc.), sampled from PersonaChat dialogue dataset. Together they form a complete description of the persona. ### Paper You can find more details about the assessment, a list of psychological prompts, roles and experiments in the paper (_🚀 coming soon!_). """ EVALUATION_QUEUE_TEXT = """ ## Some good practices before submitting a model ### 1) Make sure you can load your model and tokenizer using AutoClasses: ```python from transformers import AutoConfig, AutoModel, AutoTokenizer config = AutoConfig.from_pretrained("your model name", revision=revision) model = AutoModel.from_pretrained("your model name", revision=revision) tokenizer = AutoTokenizer.from_pretrained("your model name", revision=revision) ``` If this step fails, follow the error messages to debug your model before submitting it. It's likely your model has been improperly uploaded. Note: make sure your model is public! Note: if your model needs `use_remote_code=True`, we do not support this option yet but we are working on adding it, stay posted! ### 2) Convert your model weights to [safetensors](https://huggingface.co/docs/safetensors/index) It's a new format for storing weights which is safer and faster to load and use. It will also allow us to add the number of parameters of your model to the `Extended Viewer`! ### 3) Make sure your model has an open license! This is a leaderboard for Open LLMs, and we'd love for as many people as possible to know they can use your model 🤗 ### 4) Fill up your model card When we add extra information about models to the leaderboard, it will be automatically taken from the model card ## In case of model failure If your model is displayed in the `FAILED` category, its execution stopped. Make sure you have followed the above steps first. If everything is done, check you can launch the EleutherAIHarness on your model locally, using the above command without modifications (you can add `--limit` to limit the number of examples per task). """ CITATION_BUTTON_LABEL = "Copy the following snippet to cite these results" CITATION_BUTTON_TEXT = r""" (🚀coming soon!) """