Zobrazeno 1 - 8
of 8
pro vyhledávání: '"Lieberum, Tom"'
Autor:
Lieberum, Tom, Rajamanoharan, Senthooran, Conmy, Arthur, Smith, Lewis, Sonnerat, Nicolas, Varma, Vikrant, Kramár, János, Dragan, Anca, Shah, Rohin, Nanda, Neel
Sparse autoencoders (SAEs) are an unsupervised method for learning a sparse decomposition of a neural network's latent representations into seemingly interpretable features. Despite recent excitement about their potential, research applications outsi
Externí odkaz:
http://arxiv.org/abs/2408.05147
Autor:
Rajamanoharan, Senthooran, Lieberum, Tom, Sonnerat, Nicolas, Conmy, Arthur, Varma, Vikrant, Kramár, János, Nanda, Neel
Sparse autoencoders (SAEs) are a promising unsupervised approach for identifying causally relevant and interpretable linear features in a language model's (LM) activations. To be useful for downstream tasks, SAEs need to decompose LM activations fait
Externí odkaz:
http://arxiv.org/abs/2407.14435
Autor:
Rajamanoharan, Senthooran, Conmy, Arthur, Smith, Lewis, Lieberum, Tom, Varma, Vikrant, Kramár, János, Shah, Rohin, Nanda, Neel
Recent work has found that sparse autoencoders (SAEs) are an effective technique for unsupervised discovery of interpretable features in language models' (LMs) activations, by finding sparse, linear reconstructions of LM activations. We introduce the
Externí odkaz:
http://arxiv.org/abs/2404.16014
Autor:
Phuong, Mary, Aitchison, Matthew, Catt, Elliot, Cogan, Sarah, Kaskasoli, Alexandre, Krakovna, Victoria, Lindner, David, Rahtz, Matthew, Assael, Yannis, Hodkinson, Sarah, Howard, Heidi, Lieberum, Tom, Kumar, Ramana, Raad, Maria Abi, Webson, Albert, Ho, Lewis, Lin, Sharon, Farquhar, Sebastian, Hutter, Marcus, Deletang, Gregoire, Ruoss, Anian, El-Sayed, Seliem, Brown, Sasha, Dragan, Anca, Shah, Rohin, Dafoe, Allan, Shevlane, Toby
To understand the risks posed by a new AI system, we must understand what it can and cannot do. Building on prior work, we introduce a programme of new "dangerous capability" evaluations and pilot them on Gemini 1.0 models. Our evaluations cover four
Externí odkaz:
http://arxiv.org/abs/2403.13793
Activation Patching is a method of directly computing causal attributions of behavior to model components. However, applying it exhaustively requires a sweep with cost scaling linearly in the number of model components, which can be prohibitively exp
Externí odkaz:
http://arxiv.org/abs/2403.00745
Autor:
Lieberum, Tom, Rahtz, Matthew, Kramár, János, Nanda, Neel, Irving, Geoffrey, Shah, Rohin, Mikulik, Vladimir
\emph{Circuit analysis} is a promising technique for understanding the internal mechanisms of language models. However, existing analyses are done in small models far from the state of the art. To address this, we present a case study of circuit anal
Externí odkaz:
http://arxiv.org/abs/2307.09458
Neural networks often exhibit emergent behavior, where qualitatively new capabilities arise from scaling up the amount of parameters, training data, or training steps. One approach to understanding emergence is to find continuous \textit{progress mea
Externí odkaz:
http://arxiv.org/abs/2301.05217
Autor:
Shah, Rohin, Wang, Steven H., Wild, Cody, Milani, Stephanie, Kanervisto, Anssi, Goecks, Vinicius G., Waytowich, Nicholas, Watkins-Valls, David, Prakash, Bharat, Mills, Edmund, Garg, Divyansh, Fries, Alexander, Souly, Alexandra, Shern, Chan Jun, del Castillo, Daniel, Lieberum, Tom
We held the first-ever MineRL Benchmark for Agents that Solve Almost-Lifelike Tasks (MineRL BASALT) Competition at the Thirty-fifth Conference on Neural Information Processing Systems (NeurIPS 2021). The goal of the competition was to promote researc
Externí odkaz:
http://arxiv.org/abs/2204.07123