Výsledky vyhledávání

Report

PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning

Autor: Fu, Tingchen, Sharma, Mrinank, Torr, Philip, Cohen, Shay B., Krueger, David, Barez, Fazl

Preference learning is a central component for aligning current LLMs, but this process can be vulnerable to data poisoning attacks. To address this concern, we introduce PoisonBench, a benchmark for evaluating large language models' susceptibility to

Externí odkaz: http://arxiv.org/abs/2410.08811

Zobrazit plný text záznamu

Report

Towards Interpreting Visual Information Processing in Vision-Language Models

Autor: Neo, Clement, Ong, Luke, Torr, Philip, Geva, Mor, Krueger, David, Barez, Fazl

Vision-Language Models (VLMs) are powerful tools for processing and understanding text and images. We study the processing of visual tokens in the language model component of LLaVA, a prominent VLM. Our approach focuses on analyzing the localization

Externí odkaz: http://arxiv.org/abs/2410.07149

Zobrazit plný text záznamu

Report

Sparse Autoencoders Reveal Universal Feature Spaces Across Large Language Models

Autor: Lan, Michael, Torr, Philip, Meek, Austin, Khakzar, Ashkan, Krueger, David, Barez, Fazl

We investigate feature universality in large language models (LLMs), a research field that aims to understand how different models similarly represent concepts in the latent spaces of their intermediate layers. Demonstrating feature universality allo

Externí odkaz: http://arxiv.org/abs/2410.06981

Zobrazit plný text záznamu

Report

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

Autor: Denison, Carson, MacDiarmid, Monte, Barez, Fazl, Duvenaud, David, Kravec, Shauna, Marks, Samuel, Schiefer, Nicholas, Soklaski, Ryan, Tamkin, Alex, Kaplan, Jared, Shlegeris, Buck, Bowman, Samuel R., Perez, Ethan, Hubinger, Evan

In reinforcement learning, specification gaming occurs when AI systems learn undesired behaviors that are highly rewarded due to misspecified training goals. Specification gaming can range from simple behaviors like sycophancy to sophisticated and pe

Externí odkaz: http://arxiv.org/abs/2406.10162

Zobrazit plný text záznamu

Report

Risks and Opportunities of Open-Source Generative AI

Applications of Generative AI (Gen AI) are expected to revolutionize a number of different areas, ranging from science & medicine to education. The potential for these seismic changes has triggered a lively debate about the potential risks of the tec

Externí odkaz: http://arxiv.org/abs/2405.08597

Zobrazit plný text záznamu

Report

Visualizing Neural Network Imagination

Autor: Wichers, Nevan, Tao, Victor, Volpato, Riccardo, Barez, Fazl

In certain situations, neural networks will represent environment states in their hidden activations. Our goal is to visualize what environment states the networks are representing. We experiment with a recurrent neural network (RNN) architecture wit

Externí odkaz: http://arxiv.org/abs/2405.06409

Zobrazit plný text záznamu

Report

Near to Mid-term Risks and Opportunities of Open-Source Generative AI

In the next few years, applications of Generative AI are expected to revolutionize a number of different areas, ranging from science & medicine to education. The potential for these seismic changes has triggered a lively debate about potential risks

Externí odkaz: http://arxiv.org/abs/2404.17047

Zobrazit plný text záznamu

Report

Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions

Autor: Neo, Clement, Cohen, Shay B., Barez, Fazl

In this paper, we investigate the interplay between attention heads and specialized "next-token" neurons in the Multilayer Perceptron that predict specific tokens. By prompting an LLM like GPT-4 to explain these model internals, we can elucidate atte

Externí odkaz: http://arxiv.org/abs/2402.15055

Zobrazit plný text záznamu

Report

Increasing Trust in Language Models through the Reuse of Verified Circuits

Autor: Quirke, Philip, Neo, Clement, Barez, Fazl

Language Models (LMs) are increasingly used for a wide range of prediction tasks, but their training can often neglect rare edge cases, reducing their reliability. Here, we define a stringent standard of trustworthiness whereby the task algorithm and

Externí odkaz: http://arxiv.org/abs/2402.02619

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání