Zobrazeno 1 - 10
of 1 601
pro vyhledávání: '"Fazl A"'
Preference learning is a central component for aligning current LLMs, but this process can be vulnerable to data poisoning attacks. To address this concern, we introduce PoisonBench, a benchmark for evaluating large language models' susceptibility to
Externí odkaz:
http://arxiv.org/abs/2410.08811
Vision-Language Models (VLMs) are powerful tools for processing and understanding text and images. We study the processing of visual tokens in the language model component of LLaVA, a prominent VLM. Our approach focuses on analyzing the localization
Externí odkaz:
http://arxiv.org/abs/2410.07149
We investigate feature universality in large language models (LLMs), a research field that aims to understand how different models similarly represent concepts in the latent spaces of their intermediate layers. Demonstrating feature universality allo
Externí odkaz:
http://arxiv.org/abs/2410.06981
Autor:
Denison, Carson, MacDiarmid, Monte, Barez, Fazl, Duvenaud, David, Kravec, Shauna, Marks, Samuel, Schiefer, Nicholas, Soklaski, Ryan, Tamkin, Alex, Kaplan, Jared, Shlegeris, Buck, Bowman, Samuel R., Perez, Ethan, Hubinger, Evan
In reinforcement learning, specification gaming occurs when AI systems learn undesired behaviors that are highly rewarded due to misspecified training goals. Specification gaming can range from simple behaviors like sycophancy to sophisticated and pe
Externí odkaz:
http://arxiv.org/abs/2406.10162
Autor:
Eiras, Francisco, Petrov, Aleksandar, Vidgen, Bertie, Schroeder, Christian, Pizzati, Fabio, Elkins, Katherine, Mukhopadhyay, Supratik, Bibi, Adel, Purewal, Aaron, Botos, Csaba, Steibel, Fabro, Keshtkar, Fazel, Barez, Fazl, Smith, Genevieve, Guadagni, Gianluca, Chun, Jon, Cabot, Jordi, Imperial, Joseph, Nolazco, Juan Arturo, Landay, Lori, Jackson, Matthew, Torr, Phillip H. S., Darrell, Trevor, Lee, Yong, Foerster, Jakob
Applications of Generative AI (Gen AI) are expected to revolutionize a number of different areas, ranging from science & medicine to education. The potential for these seismic changes has triggered a lively debate about the potential risks of the tec
Externí odkaz:
http://arxiv.org/abs/2405.08597
In certain situations, neural networks will represent environment states in their hidden activations. Our goal is to visualize what environment states the networks are representing. We experiment with a recurrent neural network (RNN) architecture wit
Externí odkaz:
http://arxiv.org/abs/2405.06409
Publikováno v:
Fiyz̤, Vol 24, Iss 3, Pp 350-356 (2020)
Background: Depression is a common disease and understanding its influential factors is important. This study aimed to investigate the prevalence of depression among medical students of Balkh University of Medical Science. Materials and Methods: In
Externí odkaz:
https://doaj.org/article/a25a4e2a227842d0a1d3421d81a60a7d
Autor:
Eiras, Francisco, Petrov, Aleksandar, Vidgen, Bertie, de Witt, Christian Schroeder, Pizzati, Fabio, Elkins, Katherine, Mukhopadhyay, Supratik, Bibi, Adel, Csaba, Botos, Steibel, Fabro, Barez, Fazl, Smith, Genevieve, Guadagni, Gianluca, Chun, Jon, Cabot, Jordi, Imperial, Joseph Marvin, Nolazco-Flores, Juan A., Landay, Lori, Jackson, Matthew, Röttger, Paul, Torr, Philip H. S., Darrell, Trevor, Lee, Yong Suk, Foerster, Jakob
In the next few years, applications of Generative AI are expected to revolutionize a number of different areas, ranging from science & medicine to education. The potential for these seismic changes has triggered a lively debate about potential risks
Externí odkaz:
http://arxiv.org/abs/2404.17047
In this paper, we investigate the interplay between attention heads and specialized "next-token" neurons in the Multilayer Perceptron that predict specific tokens. By prompting an LLM like GPT-4 to explain these model internals, we can elucidate atte
Externí odkaz:
http://arxiv.org/abs/2402.15055
Language Models (LMs) are increasingly used for a wide range of prediction tasks, but their training can often neglect rare edge cases, reducing their reliability. Here, we define a stringent standard of trustworthiness whereby the task algorithm and
Externí odkaz:
http://arxiv.org/abs/2402.02619