Výsledky vyhledávání - "Cargnelutti, Matteo"

Report

SEAL: Systematic Error Analysis for Value ALignment

Autor: Revel, Manon, Cargnelutti, Matteo, Eloundou, Tyna, Leppert, Greg

Reinforcement Learning from Human Feedback (RLHF) aims to align language models (LMs) with human values by training reward models (RMs) on binary preferences and using these RMs to fine-tune the base LMs. Despite its importance, the internal mechanis

Externí odkaz: http://arxiv.org/abs/2408.10270

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání