Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach

Autor: Wang, Tony T., Hughes, John, Sleight, Henry, Schaeffer, Rylan, Agrawal, Rajashree, Barez, Fazl, Sharma, Mrinank, Mu, Jesse, Shavit, Nir, Perez, Ethan
Rok vydání: 2024
Předmět:
Druh dokumentu: Working Paper
Popis: Defending large language models against jailbreaks so that they never engage in a broadly-defined set of forbidden behaviors is an open problem. In this paper, we investigate the difficulty of jailbreak-defense when we only want to forbid a narrowly-defined set of behaviors. As a case study, we focus on preventing an LLM from helping a user make a bomb. We find that popular defenses such as safety training, adversarial training, and input/output classifiers are unable to fully solve this problem. In pursuit of a better solution, we develop a transcript-classifier defense which outperforms the baseline defenses we test. However, our classifier defense still fails in some circumstances, which highlights the difficulty of jailbreak-defense even in a narrow domain.
Comment: Accepted to the AdvML-Frontiers and SoLaR workshops at NeurIPS 2024
Databáze: arXiv