Zobrazeno 1 - 9
of 9
pro vyhledávání: '"Rager, Can"'
Autor:
Mueller, Aaron, Brinkmann, Jannik, Li, Millicent, Marks, Samuel, Pal, Koyena, Prakash, Nikhil, Rager, Can, Sankaranarayanan, Aruna, Sharma, Arnab Sen, Sun, Jiuding, Todd, Eric, Bau, David, Belinkov, Yonatan
Interpretability provides a toolset for understanding how and why neural networks behave in certain ways. However, there is little unity in the field: most studies employ ad-hoc evaluations and do not share theoretical foundations, making it difficul
Externí odkaz:
http://arxiv.org/abs/2408.01416
Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models
Autor:
Karvonen, Adam, Wright, Benjamin, Rager, Can, Angell, Rico, Brinkmann, Jannik, Smith, Logan, Verdun, Claudio Mayrink, Bau, David, Marks, Samuel
What latent features are encoded in language model (LM) representations? Recent work on training sparse autoencoders (SAEs) to disentangle interpretable features in LM representations has shown significant promise. However, evaluating the quality of
Externí odkaz:
http://arxiv.org/abs/2408.00113
Autor:
Fiotto-Kaufman, Jaden, Loftus, Alexander R, Todd, Eric, Brinkmann, Jannik, Juang, Caden, Pal, Koyena, Rager, Can, Mueller, Aaron, Marks, Samuel, Sharma, Arnab Sen, Lucchetti, Francesca, Ripa, Michael, Belfki, Adam, Prakash, Nikhil, Multani, Sumeet, Brodley, Carla, Guha, Arjun, Bell, Jonathan, Wallace, Byron, Bau, David
The enormous scale of state-of-the-art foundation models has limited their accessibility to scientists, because customized experiments at large model sizes require costly hardware and complex engineering that is impractical for most researchers. To a
Externí odkaz:
http://arxiv.org/abs/2407.14561
We introduce methods for discovering and applying sparse feature circuits. These are causally implicated subnetworks of human-interpretable features for explaining language model behaviors. Circuits identified in prior work consist of polysemantic an
Externí odkaz:
http://arxiv.org/abs/2403.19647
Autor:
Ivanitskiy, Michael Igorevich, Spies, Alex F., Räuker, Tilman, Corlouer, Guillaume, Mathwin, Chris, Quirke, Lucia, Rager, Can, Shah, Rusheb, Valentine, Dan, Behn, Cecilia Diniz, Inoue, Katsumi, Fung, Samy Wu
Transformer models underpin many recent advances in practical machine learning applications, yet understanding their internal behavior continues to elude researchers. Given the size and complexity of these models, forming a comprehensive picture of t
Externí odkaz:
http://arxiv.org/abs/2312.02566
Automated interpretability research has recently attracted attention as a potential research direction that could scale explanations of neural network behavior to large models. Existing automated circuit discovery work applies activation patching to
Externí odkaz:
http://arxiv.org/abs/2310.10348
How do language models deal with the limited bandwidth of the residual stream? Prior work has suggested that some attention heads and MLP layers may perform a "memory management" role. That is, clearing residual stream directions set by earlier layer
Externí odkaz:
http://arxiv.org/abs/2310.07325
Autor:
Ivanitskiy, Michael Igorevich, Shah, Rusheb, Spies, Alex F., Räuker, Tilman, Valentine, Dan, Rager, Can, Quirke, Lucia, Mathwin, Chris, Corlouer, Guillaume, Behn, Cecilia Diniz, Fung, Samy Wu
Understanding how machine learning models respond to distributional shifts is a key research challenge. Mazes serve as an excellent testbed due to varied generation algorithms offering a nuanced platform to simulate both subtle and pronounced distrib
Externí odkaz:
http://arxiv.org/abs/2309.10498
Autor:
Rager, Can, Webster, Kyle
The scalability of modern computing hardware is limited by physical bottlenecks and high energy consumption. These limitations could be addressed by neuromorphic hardware (NMH) which is inspired by the human brain. NMH enables physically built-in cap
Externí odkaz:
http://arxiv.org/abs/2301.10201