Zobrazeno 1 - 10
of 10
pro vyhledávání: '"Turner, Alexander Matt"'
Neural networks are trained primarily based on their inputs and outputs, without regard for their internal mechanisms. These neglected mechanisms determine properties that are critical for safety, like (i) transparency; (ii) the absence of sensitive
Externí odkaz:
http://arxiv.org/abs/2410.04332
Autor:
Panickssery, Nina, Gabrieli, Nick, Schulz, Julian, Tong, Meg, Hubinger, Evan, Turner, Alexander Matt
We introduce Contrastive Activation Addition (CAA), an innovative method for steering language models by modifying their activations during forward passes. CAA computes "steering vectors" by averaging the difference in residual stream activations bet
Externí odkaz:
http://arxiv.org/abs/2312.06681
Autor:
Mini, Ulisse, Grietzer, Peli, Sharma, Mrinank, Meek, Austin, MacDiarmid, Monte, Turner, Alexander Matt
To understand the goals and goal representations of AI systems, we carefully study a pretrained reinforcement learning policy that solves mazes by navigating to a range of target squares. We find this network pursues multiple context-dependent goals,
Externí odkaz:
http://arxiv.org/abs/2310.08043
Autor:
Turner, Alexander Matt, Thiergart, Lisa, Leech, Gavin, Udell, David, Vazquez, Juan J., Mini, Ulisse, MacDiarmid, Monte
Prompt engineering and finetuning aim to maximize language model performance on a given metric (like toxicity reduction). However, these methods do not fully elicit a model's capabilities. To reduce this gap, we introduce activation engineering: the
Externí odkaz:
http://arxiv.org/abs/2308.10248
If capable AI agents are generally incentivized to seek power in service of the objectives we specify for them, then these systems will pose enormous risks, in addition to enormous benefits. In fully observable environments, most reward functions hav
Externí odkaz:
http://arxiv.org/abs/2206.13477
Autor:
Turner, Alexander Matt
We do not know how to align a very intelligent AI agent's behavior with human interests. I investigate whether -- absent a full solution to this AI alignment problem -- we can build smart AI agents which have limited impact on the world, and which do
Externí odkaz:
http://arxiv.org/abs/2206.11831
AI objectives are often hard to specify properly. Some approaches tackle this problem by regularizing the AI's side effects: Agents must weigh off "how much of a mess they make" with an imperfectly specified proxy objective. We propose a formal crite
Externí odkaz:
http://arxiv.org/abs/2206.11812
Reward function specification can be difficult. Rewarding the agent for making a widget may be easy, but penalizing the multitude of possible negative side effects is hard. In toy environments, Attainable Utility Preservation (AUP) avoided side effec
Externí odkaz:
http://arxiv.org/abs/2006.06547
Some researchers speculate that intelligent reinforcement learning (RL) agents would be incentivized to seek resources and power in pursuit of their objectives. Other researchers point out that RL agents need not have human-like power-seeking instinc
Externí odkaz:
http://arxiv.org/abs/1912.01683
Reward functions are easy to misspecify; although designers can make corrections after observing mistakes, an agent pursuing a misspecified reward function can irreversibly change the state of its environment. If that change precludes optimization of
Externí odkaz:
http://arxiv.org/abs/1902.09725