Zobrazeno 1 - 1
of 1
pro vyhledávání: '"Edkins, Giles"'
Autor:
Rosati, Domenic, Edkins, Giles, Raj, Harsh, Atanasov, David, Majumdar, Subhabrata, Rajendran, Janarthanan, Rudzicz, Frank, Sajjad, Hassan
While there has been progress towards aligning Large Language Models (LLMs) with human values and ensuring safe behaviour at inference time, safety-aligned LLMs are known to be vulnerable to training-time attacks such as supervised fine-tuning (SFT)
Externí odkaz:
http://arxiv.org/abs/2409.12914