Robert Kirk's picture

15 3 1

Robert Kirk

robkirk

https://robertkirk.github.io/

AI & ML interests

AI Alignment and Safety, Generalisation, RLHF, LLMs

Organizations

robkirk's activity

upvoted 2 papers 10 months ago

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Paper • 2401.05566 • Published Jan 10 • 26

Mechanistically analyzing the effects of fine-tuning on procedurally defined tasks

Paper • 2311.12786 • Published Nov 21, 2023 • 2

upvoted a paper about 1 year ago

Understanding the Effects of RLHF on LLM Generalisation and Diversity

Paper • 2310.06452 • Published Oct 10, 2023 • 2