Zobrazeno 1 - 4
of 4
pro vyhledávání: '"Panickssery, Nina"'
It has been reported that LLMs can recognize their own writing. As this has potential implications for AI safety, yet is relatively understudied, we investigate the phenomenon, seeking to establish whether it robustly occurs at the behavioral level,
Externí odkaz:
http://arxiv.org/abs/2410.02064
Autor:
Arditi, Andy, Obeso, Oscar, Syed, Aaquib, Paleka, Daniel, Panickssery, Nina, Gurnee, Wes, Nanda, Neel
Conversational large language models are fine-tuned for both instruction-following and safety, resulting in models that obey benign requests but refuse harmful ones. While this refusal behavior is widespread across chat models, its underlying mechani
Externí odkaz:
http://arxiv.org/abs/2406.11717
Conversational large language models are trained to refuse to answer harmful questions. However, emergent jailbreaking techniques can still elicit unsafe outputs, presenting an ongoing challenge for model alignment. To better understand how different
Externí odkaz:
http://arxiv.org/abs/2406.09289
Autor:
Panickssery, Nina, Gabrieli, Nick, Schulz, Julian, Tong, Meg, Hubinger, Evan, Turner, Alexander Matt
We introduce Contrastive Activation Addition (CAA), an innovative method for steering language models by modifying their activations during forward passes. CAA computes "steering vectors" by averaging the difference in residual stream activations bet
Externí odkaz:
http://arxiv.org/abs/2312.06681