PKU-SafeRLHF: A Safety Alignment Preference Dataset for Llama Family Models

Autor:	Ji, Jiaming, Hong, Donghai, Zhang, Borong, Chen, Boyuan, Dai, Josef, Zheng, Boren, Qiu, Tianyi, Li, Boxun, Yang, Yaodong
Rok vydání:	2024
Předmět:	Computer Science - Artificial Intelligence Computer Science - Computation and Language
Druh dokumentu:	Working Paper
Popis:	In this work, we introduce the PKU-SafeRLHF dataset, designed to promote research on safety alignment in large language models (LLMs). As a sibling project to SafeRLHF and BeaverTails, we separate annotations of helpfulness and harmlessness for question-answering pairs, providing distinct perspectives on these coupled attributes. Overall, we provide 44.6k refined prompts and 265k question-answer pairs with safety meta-labels for 19 harm categories and three severity levels ranging from minor to severe, with answers generated by Llama-family models. Based on this, we collected 166.8k preference data, including dual-preference (helpfulness and harmlessness decoupled) and single-preference data (trade-off the helpfulness and harmlessness from scratch), respectively. Using the large-scale annotation data, we further train severity-sensitive moderation for the risk control of LLMs and safety-centric RLHF algorithms for the safety alignment of LLMs. We believe this dataset will be a valuable resource for the community, aiding in the safe deployment of LLMs. Comment: a sibling project to SafeRLHF and BeaverTails
Databáze:	arXiv
Externí odkaz:	http://arxiv.org/abs/2406.15513 Zobrazit plný text záznamu View this record from Arxiv