Výsledky vyhledávání - "Varma, Vikrant"

Report

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

Autor: Lieberum, Tom, Rajamanoharan, Senthooran, Conmy, Arthur, Smith, Lewis, Sonnerat, Nicolas, Varma, Vikrant, Kramár, János, Dragan, Anca, Shah, Rohin, Nanda, Neel

Sparse autoencoders (SAEs) are an unsupervised method for learning a sparse decomposition of a neural network's latent representations into seemingly interpretable features. Despite recent excitement about their potential, research applications outsi

Externí odkaz: http://arxiv.org/abs/2408.05147

Zobrazit plný text záznamu

Report

Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders

Autor: Rajamanoharan, Senthooran, Lieberum, Tom, Sonnerat, Nicolas, Conmy, Arthur, Varma, Vikrant, Kramár, János, Nanda, Neel

Sparse autoencoders (SAEs) are a promising unsupervised approach for identifying causally relevant and interpretable linear features in a language model's (LM) activations. To be useful for downstream tasks, SAEs need to decompose LM activations fait

Externí odkaz: http://arxiv.org/abs/2407.14435

Zobrazit plný text záznamu

Report

Improving Dictionary Learning with Gated Sparse Autoencoders

Autor: Rajamanoharan, Senthooran, Conmy, Arthur, Smith, Lewis, Lieberum, Tom, Varma, Vikrant, Kramár, János, Shah, Rohin, Nanda, Neel

Recent work has found that sparse autoencoders (SAEs) are an effective technique for unsupervised discovery of interpretable features in language models' (LMs) activations, by finding sparse, linear reconstructions of LM activations. We introduce the

Externí odkaz: http://arxiv.org/abs/2404.16014

Zobrazit plný text záznamu

Report

Challenges with unsupervised LLM knowledge discovery

Autor: Farquhar, Sebastian, Varma, Vikrant, Kenton, Zachary, Gasteiger, Johannes, Mikulik, Vladimir, Shah, Rohin

We show that existing unsupervised methods on large language model (LLM) activations do not discover knowledge -- instead they seem to discover whatever feature of the activations is most prominent. The idea behind unsupervised knowledge elicitation

Externí odkaz: http://arxiv.org/abs/2312.10029

Zobrazit plný text záznamu

Report

Explaining grokking through circuit efficiency

Autor: Varma, Vikrant, Shah, Rohin, Kenton, Zachary, Kramár, János, Kumar, Ramana

One of the most surprising puzzles in neural network generalisation is grokking: a network with perfect training accuracy but poor generalisation will, upon further training, transition to perfect generalisation. We propose that grokking occurs when

Externí odkaz: http://arxiv.org/abs/2309.02390

Zobrazit plný text záznamu

Report

Goal Misgeneralization: Why Correct Specifications Aren't Enough For Correct Goals

Autor: Shah, Rohin, Varma, Vikrant, Kumar, Ramana, Phuong, Mary, Krakovna, Victoria, Uesato, Jonathan, Kenton, Zac

The field of AI alignment is concerned with AI systems that pursue unintended goals. One commonly studied mechanism by which an unintended goal might arise is specification gaming, in which the designer-provided specification is flawed in a way that

Externí odkaz: http://arxiv.org/abs/2210.01790

Zobrazit plný text záznamu

Report

Safe Deep RL in 3D Environments using Human Feedback

Autor: Rahtz, Matthew, Varma, Vikrant, Kumar, Ramana, Kenton, Zachary, Legg, Shane, Leike, Jan

Agents should avoid unsafe behaviour during both training and deployment. This typically requires a simulator and a procedural specification of unsafe behaviour. Unfortunately, a simulator is not always available, and procedurally specifying constrai

Externí odkaz: http://arxiv.org/abs/2201.08102

Zobrazit plný text záznamu

Report

Imitating Interactive Intelligence

A common vision from science fiction is that robots will one day inhabit our physical spaces, sense the world as we do, assist our physical labours, and communicate with us through natural language. Here we study how to design artificial agents that

Externí odkaz: http://arxiv.org/abs/2012.05672

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání