FlipGuard: Defending Preference Alignment against Update Regression with Constrained Optimization

Autor: Zhu, Mingye, Liu, Yi, Wang, Quan, Guo, Junbo, Mao, Zhendong
Rok vydání: 2024
Předmět:
Druh dokumentu: Working Paper
Popis: Recent breakthroughs in preference alignment have significantly improved Large Language Models' ability to generate texts that align with human preferences and values. However, current alignment metrics typically emphasize the post-hoc overall improvement, while overlooking a critical aspect: regression, which refers to the backsliding on previously correctly-handled data after updates. This potential pitfall may arise from excessive fine-tuning on already well-aligned data, which subsequently leads to over-alignment and degeneration. To address this challenge, we propose FlipGuard, a constrained optimization approach to detect and mitigate update regression with focal attention. Specifically, FlipGuard identifies performance degradation using a customized reward characterization and strategically enforces a constraint to encourage conditional congruence with the pre-aligned model during training. Comprehensive experiments demonstrate that FlipGuard effectively alleviates update regression while demonstrating excellent overall performance, with the added benefit of knowledge preservation while aligning preferences.
Comment: Accepted by EMNLP 2024 Main track
Databáze: arXiv