arxiv:2406.11614

Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces

Published on Jun 17

· Submitted by

YihuaiHong on Jun 20

Upvote

Authors:

Yihuai Hong ,

Shauli Ravfogel ,

Mor Geva

Abstract

The task of "unlearning" certain concepts in large language models (LLMs) has attracted immense attention recently, due to its importance for mitigating undesirable model behaviours, such as the generation of harmful, private, or incorrect information. Current protocols to evaluate unlearning methods largely rely on behavioral tests, without monitoring the presence of unlearned knowledge within the model's parameters. This residual knowledge can be adversarially exploited to recover the erased information post-unlearning. We argue that unlearning should also be evaluated internally, by considering changes in the parametric knowledge traces of the unlearned concepts. To this end, we propose a general methodology for eliciting directions in the parameter space (termed "concept vectors") that encode concrete concepts, and construct ConceptVectors, a benchmark dataset containing hundreds of common concepts and their parametric knowledge traces within two open-source LLMs. Evaluation on ConceptVectors shows that existing unlearning methods minimally impact concept vectors, while directly ablating these vectors demonstrably removes the associated knowledge from the LLMs and significantly reduces their susceptibility to adversarial manipulation. Our results highlight limitations in behavioral-based unlearning evaluations and call for future work to include parametric-based evaluations. To support this, we release our code and benchmark at https://github.com/yihuaihong/ConceptVectors.

View arXiv page View PDF Add to collection

Community

YihuaiHong

Paper author Paper submitter Jun 20

•

edited Jun 20

🚀The first-ever parametric LLM Unlearning Benchmark!

We find current unlearning methods only modify model’s behavior without truly erasing encoded knowledge in parameters. For this, we present ConceptVectors Benchmark, with each vector strongly tied to a specific concept.

The ConceptVectors Benchmark for the paper "Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces".

Paper: https://arxiv.org/pdf/2406.11614

Website: https://yihuaihong.github.io/ConceptVectors.github.io

Github: https://github.com/yihuaihong/ConceptVectors