Zobrazeno 1 - 10
of 1 519
pro vyhledávání: '"CHAN, Lawrence"'
Autor:
Wijk, Hjalmar, Lin, Tao, Becker, Joel, Jawhar, Sami, Parikh, Neev, Broadley, Thomas, Chan, Lawrence, Chen, Michael, Clymer, Josh, Dhyani, Jai, Ericheva, Elena, Garcia, Katharyn, Goodrich, Brian, Jurkovic, Nikola, Kinniment, Megan, Lajko, Aron, Nix, Seraphina, Sato, Lucas, Saunders, William, Taran, Maksym, West, Ben, Barnes, Elizabeth
Frontier AI safety policies highlight automation of AI research and development (R&D) by AI agents as an important capability to anticipate. However, there exist few evaluations for AI R&D capabilities, and none that are highly realistic and have a d
Externí odkaz:
http://arxiv.org/abs/2411.15114
Superposition -- when a neural network represents more ``features'' than it has dimensions -- seems to pose a serious challenge to mechanistically interpreting current AI systems. Existing theory work studies \emph{representational} superposition, wh
Externí odkaz:
http://arxiv.org/abs/2408.05451
Autor:
Gross, Jason, Agrawal, Rajashree, Kwa, Thomas, Ong, Euan, Yip, Chun Hei, Gibson, Alex, Noubir, Soufiane, Chan, Lawrence
We propose using mechanistic interpretability -- techniques for reverse engineering model weights into human-interpretable algorithms -- to derive and compactly prove formal guarantees on model performance. We prototype this approach by formally prov
Externí odkaz:
http://arxiv.org/abs/2406.11779
Autor:
Kinniment, Megan, Sato, Lucas Jun Koba, Du, Haoxing, Goodrich, Brian, Hasin, Max, Chan, Lawrence, Miles, Luke Harold, Lin, Tao R., Wijk, Hjalmar, Burget, Joel, Ho, Aaron, Barnes, Elizabeth, Christiano, Paul
In this report, we explore the ability of language model agents to acquire resources, create copies of themselves, and adapt to novel challenges they encounter in the wild. We refer to this cluster of capabilities as "autonomous replication and adapt
Externí odkaz:
http://arxiv.org/abs/2312.11671
Universality is a key hypothesis in mechanistic interpretability -- that different models learn similar features and circuits when trained on similar tasks. In this work, we study the universality hypothesis by examining how small neural networks lea
Externí odkaz:
http://arxiv.org/abs/2302.03025
Neural networks often exhibit emergent behavior, where qualitatively new capabilities arise from scaling up the amount of parameters, training data, or training steps. One approach to understanding emergence is to find continuous \textit{progress mea
Externí odkaz:
http://arxiv.org/abs/2301.05217