Zobrazeno 1 - 10
of 2 800
pro vyhledávání: '"P. SLEIGHT"'
Holographic correlators on the celestial sphere of Minkowski space were recently defined in arXiv:2301.01810 as the extrapolation of bulk time-ordered correlation functions to the celestial sphere. In this work we explore the Mellin representation of
Externí odkaz:
http://arxiv.org/abs/2412.11992
Autor:
Hughes, John, Price, Sara, Lynch, Aengus, Schaeffer, Rylan, Barez, Fazl, Koyejo, Sanmi, Sleight, Henry, Jones, Erik, Perez, Ethan, Sharma, Mrinank
We introduce Best-of-N (BoN) Jailbreaking, a simple black-box algorithm that jailbreaks frontier AI systems across modalities. BoN Jailbreaking works by repeatedly sampling variations of a prompt with a combination of augmentations - such as random s
Externí odkaz:
http://arxiv.org/abs/2412.03556
Autor:
Wang, Tony T., Hughes, John, Sleight, Henry, Schaeffer, Rylan, Agrawal, Rajashree, Barez, Fazl, Sharma, Mrinank, Mu, Jesse, Shavit, Nir, Perez, Ethan
Defending large language models against jailbreaks so that they never engage in a broadly-defined set of forbidden behaviors is an open problem. In this paper, we investigate the difficulty of jailbreak-defense when we only want to forbid a narrowly-
Externí odkaz:
http://arxiv.org/abs/2412.02159
Autor:
Wen, Jiaxin, Hebbar, Vivek, Larson, Caleb, Bhatt, Aryan, Radhakrishnan, Ansh, Sharma, Mrinank, Sleight, Henry, Feng, Shi, He, He, Perez, Ethan, Shlegeris, Buck, Khan, Akbir
As large language models (LLMs) become increasingly capable, it is prudent to assess whether safety measures remain effective even if LLMs intentionally try to bypass them. Previous work introduced control evaluations, an adversarial framework for te
Externí odkaz:
http://arxiv.org/abs/2411.17693
As large language models (LLMs) grow more powerful, ensuring their safety against misuse becomes crucial. While researchers have focused on developing robust defenses, no method has yet achieved complete invulnerability to attacks. We propose an alte
Externí odkaz:
http://arxiv.org/abs/2411.07494
Autor:
Binder, Felix J, Chua, James, Korbak, Tomek, Sleight, Henry, Hughes, John, Long, Robert, Perez, Ethan, Turpin, Miles, Evans, Owain
Humans acquire knowledge by observing the external world, but also by introspection. Introspection gives a person privileged access to their current state of mind (e.g., thoughts and feelings) that is not accessible to external observers. Can LLMs in
Externí odkaz:
http://arxiv.org/abs/2410.13787
We consider late-time correlators in de Sitter (dS) space for initial states related to the Bunch-Davies vacuum by a Bogoliubov transformation. We propose to study such late-time correlators by reformulating them in the familiar language of Witten di
Externí odkaz:
http://arxiv.org/abs/2407.16652
Autor:
Sheshadri, Abhay, Ewart, Aidan, Guo, Phillip, Lynch, Aengus, Wu, Cindy, Hebbar, Vivek, Sleight, Henry, Stickland, Asa Cooper, Perez, Ethan, Hadfield-Menell, Dylan, Casper, Stephen
Large language models (LLMs) can often be made to behave in undesirable ways that they are explicitly fine-tuned not to. For example, the LLM red-teaming literature has produced a wide variety of 'jailbreaking' techniques to elicit harmful text from
Externí odkaz:
http://arxiv.org/abs/2407.15549
Autor:
Schaeffer, Rylan, Valentine, Dan, Bailey, Luke, Chua, James, Eyzaguirre, Cristóbal, Durante, Zane, Benton, Joe, Miranda, Brando, Sleight, Henry, Hughes, John, Agrawal, Rajashree, Sharma, Mrinank, Emmons, Scott, Koyejo, Sanmi, Perez, Ethan
The integration of new modalities into frontier AI systems offers exciting capabilities, but also increases the possibility such systems can be adversarially manipulated in undesirable ways. In this work, we focus on a popular class of vision-languag
Externí odkaz:
http://arxiv.org/abs/2407.15211
Autor:
Gerstgrasser, Matthias, Schaeffer, Rylan, Dey, Apratim, Rafailov, Rafael, Sleight, Henry, Hughes, John, Korbak, Tomasz, Agrawal, Rajashree, Pai, Dhruv, Gromov, Andrey, Roberts, Daniel A., Yang, Diyi, Donoho, David L., Koyejo, Sanmi
The proliferation of generative models, combined with pretraining on web-scale data, raises a timely question: what happens when these models are trained on their own generated outputs? Recent investigations into model-data feedback loops proposed th
Externí odkaz:
http://arxiv.org/abs/2404.01413