Papers
arxiv:2406.12066

Language Models are Surprisingly Fragile to Drug Names in Biomedical Benchmarks

Published on Jun 17
· Submitted by shanchen on Jun 19
Authors:
,
,
,

Abstract

Medical knowledge is context-dependent and requires consistent reasoning across various natural language expressions of semantically equivalent phrases. This is particularly crucial for drug names, where patients often use brand names like Advil or Tylenol instead of their generic equivalents. To study this, we create a new robustness dataset, RABBITS, to evaluate performance differences on medical benchmarks after swapping brand and generic drug names using physician expert annotations. We assess both open-source and API-based LLMs on MedQA and MedMCQA, revealing a consistent performance drop ranging from 1-10\%. Furthermore, we identify a potential source of this fragility as the contamination of test data in widely used pre-training datasets. All code is accessible at https://github.com/BittermanLab/RABBITS, and a HuggingFace leaderboard is available at https://huggingface.co/spaces/AIM-Harvard/rabbits-leaderboard.

Community

Paper author Paper submitter
edited Jun 19

Leaderboard link is here:
https://t.co/QcUqelNz8i

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2406.12066 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 1

Collections including this paper 2