Výsledky vyhledávání - "Lieberum, Tom"

Report

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

Autor: Lieberum, Tom, Rajamanoharan, Senthooran, Conmy, Arthur, Smith, Lewis, Sonnerat, Nicolas, Varma, Vikrant, Kramár, János, Dragan, Anca, Shah, Rohin, Nanda, Neel

Sparse autoencoders (SAEs) are an unsupervised method for learning a sparse decomposition of a neural network's latent representations into seemingly interpretable features. Despite recent excitement about their potential, research applications outsi

Externí odkaz: http://arxiv.org/abs/2408.05147

Zobrazit plný text záznamu

Report

Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders

Autor: Rajamanoharan, Senthooran, Lieberum, Tom, Sonnerat, Nicolas, Conmy, Arthur, Varma, Vikrant, Kramár, János, Nanda, Neel

Sparse autoencoders (SAEs) are a promising unsupervised approach for identifying causally relevant and interpretable linear features in a language model's (LM) activations. To be useful for downstream tasks, SAEs need to decompose LM activations fait

Externí odkaz: http://arxiv.org/abs/2407.14435

Zobrazit plný text záznamu

Report

Improving Dictionary Learning with Gated Sparse Autoencoders

Autor: Rajamanoharan, Senthooran, Conmy, Arthur, Smith, Lewis, Lieberum, Tom, Varma, Vikrant, Kramár, János, Shah, Rohin, Nanda, Neel

Recent work has found that sparse autoencoders (SAEs) are an effective technique for unsupervised discovery of interpretable features in language models' (LMs) activations, by finding sparse, linear reconstructions of LM activations. We introduce the

Externí odkaz: http://arxiv.org/abs/2404.16014

Zobrazit plný text záznamu

Report

Evaluating Frontier Models for Dangerous Capabilities

To understand the risks posed by a new AI system, we must understand what it can and cannot do. Building on prior work, we introduce a programme of new "dangerous capability" evaluations and pilot them on Gemini 1.0 models. Our evaluations cover four

Externí odkaz: http://arxiv.org/abs/2403.13793

Zobrazit plný text záznamu

Report

AtP*: An efficient and scalable method for localizing LLM behaviour to components

Autor: Kramár, János, Lieberum, Tom, Shah, Rohin, Nanda, Neel

Activation Patching is a method of directly computing causal attributions of behavior to model components. However, applying it exhaustively requires a sweep with cost scaling linearly in the number of model components, which can be prohibitively exp

Externí odkaz: http://arxiv.org/abs/2403.00745

Zobrazit plný text záznamu

Report

Does Circuit Analysis Interpretability Scale? Evidence from Multiple Choice Capabilities in Chinchilla

Autor: Lieberum, Tom, Rahtz, Matthew, Kramár, János, Nanda, Neel, Irving, Geoffrey, Shah, Rohin, Mikulik, Vladimir

\emph{Circuit analysis} is a promising technique for understanding the internal mechanisms of language models. However, existing analyses are done in small models far from the state of the art. To address this, we present a case study of circuit anal

Externí odkaz: http://arxiv.org/abs/2307.09458

Zobrazit plný text záznamu

Report

Progress measures for grokking via mechanistic interpretability

Autor: Nanda, Neel, Chan, Lawrence, Lieberum, Tom, Smith, Jess, Steinhardt, Jacob

Neural networks often exhibit emergent behavior, where qualitatively new capabilities arise from scaling up the amount of parameters, training data, or training steps. One approach to understanding emergence is to find continuous \textit{progress mea

Externí odkaz: http://arxiv.org/abs/2301.05217

Zobrazit plný text záznamu

Report

Retrospective on the 2021 BASALT Competition on Learning from Human Feedback

Autor: Shah, Rohin, Wang, Steven H., Wild, Cody, Milani, Stephanie, Kanervisto, Anssi, Goecks, Vinicius G., Waytowich, Nicholas, Watkins-Valls, David, Prakash, Bharat, Mills, Edmund, Garg, Divyansh, Fries, Alexander, Souly, Alexandra, Shern, Chan Jun, del Castillo, Daniel, Lieberum, Tom

We held the first-ever MineRL Benchmark for Agents that Solve Almost-Lifelike Tasks (MineRL BASALT) Competition at the Thirty-fifth Conference on Neural Information Processing Systems (NeurIPS 2021). The goal of the competition was to promote researc

Externí odkaz: http://arxiv.org/abs/2204.07123

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání