Abstrakt: |
Ensuring artificial intelligence behaves in such a way that is aligned with human values is commonly referred to as the alignment challenge. Prior work has shown that rational agents, behaving in such a way that maximizes a utility function, will inevitably behave in such a way that is not aligned with human values, especially as their level of intelligence goes up. Prior work has also shown that there is no "one true utility function"; solutions must include a more holistic approach to alignment. This paper describes apprehensive agents: agents that are architected in such a way that their effective utility function is an aggregation of a partial utility function (built by designers, to be maximized) and an expectation of negative feedback on given states (reasoned about, to be minimized). Agents are also capable of performing a temporal reasoning process that approximates designers' intentions in function of environment evolution (a necessary feature for severe mis-alignment to occur). We show that an apprehensive agent, behaving rationally, leverages this internal approximation of designers' intentions to predict negative feedback, and, as a consequence, behaves in such a way that maximizes alignment, without actually receiving any external feedback. We evaluate this strategy on simulated environments that expose mis-alignment opportunities: we show that apprehensive agents are indeed better aligned than their base counterparts and, in contrast with extant techniques, chances of alignment actually improve as agent intelligence grows. [ABSTRACT FROM AUTHOR] |