Unveiling CIVICS: A New Dataset for Examining Cultural Values in Language Models
In the rapidly evolving landscape of Artificial Intelligence (AI), it is sensibly important that technology respects and represents its users' diverse values and cultures, especially given the critical and real-world use cases where LLMs are applied. Our latest research, presented in the paper "CIVICS: Building a Dataset for Examining Culturally-Informed Values in Large Language Models", introduces a novel dataset designed to identify discrepancies among different languages and Large Language Models (LLMs) regarding this phenomenon.
Introducing CIVICS
The CIVICS (Culturally-Informed & Values-Inclusive Corpus for Societal impacts) dataset is an initiative which aims to evaluate the social and cultural variations in responses generated by open-weight LLMs. The dataset is hand-curated to include value-laden prompts in multiple languages, addressing sensitive social topics such as LGBTQI rights, social welfare, immigration, disability rights, and surrogacy. Notably, for English and French, prompts are sourced from multiple countries to ensure a broad representation of cultural perspectives. In this context, unlike previous datasets that often rely on automated translations or focus on single languages (usually English only), our CIVICS dataset is manually curated across five languages and nine national contexts.
Languages and Countries covered by the CIVICS dataset.
Why do we need datasets like CIVICS?
It is well-known that integrating LLMs into various digital infrastructures has transformed our interaction with technology. These models now support multiple services, from automated customer support to high-stakes applications like clinical decision support. Given their influence, LLMS must embody and convey culturally inclusive and pluralistic values. However, designing such systems is challenging due to the varying values across different cultures, domains, and languages.
Our motivation originally stems from an exploratory study highlighting the US-centric perspective of GPT-3 in summarizing value-laden prompts across different languages. This finding showed the initial need for a more globally inclusive approach, inspiring us to develop CIVICS.
Key Contributions of CIVICS
Multilingual and multinational scope: CIVICS encompasses five languages and nine national contexts, with samples collected by native speakers to ensure linguistic and cultural authenticity. This approach helps capture the nuanced expressions of values within each culture. Having five different languages (Turkish, German, Italian, French, English) and sourcing prompts for the same language from multiple countries (Singapore, Canada, and Australia for English; France and Canada for French) provides coverage for regions where these languages are official but have distinct cultural settings, backgrounds, and value systems compared to the US.
Diverse topics: The dataset covers a range of socially sensitive topics relevant to the socio-political landscapes of the regions where these languages are spoken. This includes issues that are often at the forefront of societal debates, providing a rich ground for evaluating the cultural biases of LLMs.
Dynamic annotation process: Our methodology involves a detailed annotation process with a focus on accuracy and consistency. Annotators, co-authors of the paper, applied multiple labels to each prompt, reflecting the diverse values inherent in the topics.
Experimental setups: We employed two experimental setups to evaluate the dataset: one based on log-probabilities in line with state-of-the-art evaluation practices of evaluation suite, and another based on long-form responses which mirrors user interaction with LLMs. These experiments revealed significant social and cultural variability across different LLMs, particularly in their responses to sensitive topics.
Findings and Implications
Our experiments using the CIVICS dataset demonstrate that refusals and response variations are more pronounced in English or translated statements. Topics like immigration, LGBTQI rights, and social welfare showed the most significant differences across models (see Figure 1 and this Space for additional LLMs responses). These first findings reveal the diverse ethical and cultural standpoints embedded in the tested open-weight LLMs, showing the necessity for datasets like CIVICS to uncover these discrepancies and guide future research in avoiding them.
Figure 1: Example of LLM responses and refusals on the topic of LGBTQI rights.
Concerning the choice of open-weight language models, we decided to focus on open models mainly to ensure scientific rigor and reproducibility. Closed models like GPT-4 and Claude 3 Opus often lack version control and transparency in their post-processing methods, making replicating results and performing thorough analyses difficult. Open-weight models allow us to have full control over the versions used, ensure consistency in evaluation, and provide the transparency necessary for reproducible and verifiable outcomes. There is also another technical reason: in order to evaluate how likely output is from base language models (also called “foundation models”), one must have access to a quantification of “likely”. In the world of LLMs, this is specifically measured as a function of the log probability for different responses. Access to log probabilities is not guaranteed across closed models, but is trivial to extract in most open models.
Refusal of Response
One of the important observations from our experiments is the difference in cultural bias between different open-weight models. For instance, the refusal to respond to certain prompts varied significantly across LLMs. These refusal rates were influenced by the implicit values of the models and the explicit decisions made by the organizations developing them. For example, refusals were particularly prevalent in topics related to LGBTQI rights and immigration. Qwen (China) had the highest number of refusals (257), followed by Mistral (France), Llama-3 (USA), and Gemma (USA).
The following chart from our results section illustrates this variation:
Figure 2: Distribution of model refusals on the topics Immigration and LGBTQI rights, by model, fine-grained labels (top), and statement region and language (bottom).
This refusal behavior suggests that different models, developed in diverse cultural contexts, exhibit varying levels of sensitivity and ethical considerations regarding certain topics. Moreover, models trained with more parameters generally exhibited higher variability in their responses, with larger models showing a greater tendency to either agree or disagree strongly with value-laden statements.
Additional Results
- Immigration: Statements on immigration saw the most disagreement ratings. Specifically, prompts in Turkish and Italian triggered the widest variety of responses across LLMs compared to English prompts.
- LGBTQI Rights: Most models tended to endorse statements related to LGBTQI rights. However, the degree of agreement varied significantly, with some models showing strong support while others were more neutral.
- Social Welfare: Similar to immigration, social welfare statements also triggered varied responses. This indicates the complex interplay between language models and culturally sensitive topics.
Future Directions
The CIVICS dataset is designed to be a tool for future research to foster the development of AI technologies that respect global cultural diversities and value pluralism. Thus, we hope to encourage further research and development in this timely sensitive area by making the dataset and tools available under open licenses.
In conclusion, by presenting CIVICS, we wish to contribute to taking a significant step towards creating AI systems that are technically proficient but especially culturally and ethically mindful. As we continue to integrate AI into our daily lives, ensuring that these technologies are inclusive and respectful of diverse values will be critical for their responsible and ethical deployment.
Furthermore, mitigating the biases observed in LLMs is a challenging task given the inherent biases baked into these models. While “perfect de-biasing” is unattainable, our research highlights the importance of implementing more comprehensive social impact evaluations that go beyond traditional statistical metrics, both quantitatively and qualitatively. We call on researchers to rigorously test their models for the cultural visions they propagate, whether intentionally or unintentionally. In a nutshell, there's no silver bullet as LLMs aren't and never will be perfect, but developing novel methods to gain insights into their behavior once deployed – and how they might affect society – is critical to building better models.
Access the CIVICS dataset here and some of the LLMs’ responses here. You can also read more about the project on TechCrunch.