Zobrazeno 1 - 2
of 2
pro vyhledávání: '"Farnik, Lucy"'
Sparse autoencoders (SAEs) are a promising approach to interpreting the internal representations of transformer language models. However, SAEs are usually trained separately on each transformer layer, making it difficult to use them to study how info
Externí odkaz:
http://arxiv.org/abs/2409.04185
Autor:
Skalse, Joar, Farnik, Lucy, Motwani, Sumeet Ramesh, Jenner, Erik, Gleave, Adam, Abate, Alessandro
In order to solve a task using reinforcement learning, it is necessary to first formalise the goal of that task as a reward function. However, for many real-world tasks, it is very difficult to manually specify a reward function that never incentivis
Externí odkaz:
http://arxiv.org/abs/2309.15257