Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces
Abstract
The task of "unlearning" certain concepts in large language models (LLMs) has attracted immense attention recently, due to its importance for mitigating undesirable model behaviours, such as the generation of harmful, private, or incorrect information. Current protocols to evaluate unlearning methods largely rely on behavioral tests, without monitoring the presence of unlearned knowledge within the model's parameters. This residual knowledge can be adversarially exploited to recover the erased information post-unlearning. We argue that unlearning should also be evaluated internally, by considering changes in the parametric knowledge traces of the unlearned concepts. To this end, we propose a general methodology for eliciting directions in the parameter space (termed "concept vectors") that encode concrete concepts, and construct ConceptVectors, a benchmark dataset containing hundreds of common concepts and their parametric knowledge traces within two open-source LLMs. Evaluation on ConceptVectors shows that existing unlearning methods minimally impact concept vectors, while directly ablating these vectors demonstrably removes the associated knowledge from the LLMs and significantly reduces their susceptibility to adversarial manipulation. Our results highlight limitations in behavioral-based unlearning evaluations and call for future work to include parametric-based evaluations. To support this, we release our code and benchmark at https://github.com/yihuaihong/ConceptVectors.
Community
🚀The first-ever parametric LLM Unlearning Benchmark!
We find current unlearning methods only modify model’s behavior without truly erasing encoded knowledge in parameters. For this, we present ConceptVectors Benchmark, with each vector strongly tied to a specific concept.
The ConceptVectors Benchmark for the paper "Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces".
Paper: https://arxiv.org/pdf/2406.11614
Website: https://yihuaihong.github.io/ConceptVectors.github.io
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper