Zobrazeno 1 - 10
of 457
pro vyhledávání: '"Krueger, David A"'
Machine unlearning is gaining increasing attention as a way to remove adversarial data poisoning attacks from already trained models and to comply with privacy and AI regulations. The objective is to unlearn the effect of undesired data from a traine
Externí odkaz:
http://arxiv.org/abs/2412.00761
A key objective of interpretability research on large language models (LLMs) is to develop methods for robustly steering models toward desired behaviors. To this end, two distinct approaches to interpretability -- ``bottom-up" and ``top-down" -- have
Externí odkaz:
http://arxiv.org/abs/2411.07213
Transformers have demonstrated remarkable in-context learning capabilities across various domains, including statistical learning tasks. While previous work has shown that transformers can implement common learning algorithms, the adversarial robustn
Externí odkaz:
http://arxiv.org/abs/2411.05189
Zero-shot coordination (ZSC) is a popular setting for studying the ability of reinforcement learning (RL) agents to coordinate with novel partners. Prior ZSC formulations assume the $\textit{problem setting}$ is common knowledge: each agent knows the
Externí odkaz:
http://arxiv.org/abs/2411.04976
Sparse Autoencoders (SAEs) have shown promise in improving the interpretability of neural network activations, but can learn features that are not features of the input, limiting their effectiveness. We propose \textsc{Mutual Feature Regularization}
Externí odkaz:
http://arxiv.org/abs/2411.01220
As reinforcement learning agents become increasingly deployed in real-world scenarios, predicting future agent actions and events during deployment is important for facilitating better human-agent interaction and preventing catastrophic outcomes. Thi
Externí odkaz:
http://arxiv.org/abs/2410.22459
Deep neural networks have proven to be extremely powerful, however, they are also vulnerable to adversarial attacks which can cause hazardous incorrect predictions in safety-critical applications. Certified robustness via randomized smoothing gives a
Externí odkaz:
http://arxiv.org/abs/2410.20432
Representation engineering methods have recently shown promise for enabling efficient steering of model behavior. However, evaluation pipelines for these methods have primarily relied on subjective demonstrations, instead of quantitative, objective m
Externí odkaz:
http://arxiv.org/abs/2410.17245
The real-time dynamics of local magnetic moments exchange coupled to a metallic system of conduction electrons is subject to dissipative friction even in the absence of spin-orbit coupling. Phenomenologically, this is usually described by a local Gil
Externí odkaz:
http://arxiv.org/abs/2410.16003
Autor:
Mlodozeniec, Bruno, Eschenhagen, Runa, Bae, Juhan, Immer, Alexander, Krueger, David, Turner, Richard
Diffusion models have led to significant advancements in generative modelling. Yet their widespread adoption poses challenges regarding data attribution and interpretability. In this paper, we aim to help address such challenges in diffusion models b
Externí odkaz:
http://arxiv.org/abs/2410.13850