Down the Toxicity Rabbit Hole: A Novel Framework to Bias Audit Large Language Models

Autor:	Dutta, Arka, Khorramrouz, Adel, Dutta, Sujan, KhudaBukhsh, Ashiqur R.
Rok vydání:	2023
Předmět:	Computer Science - Computation and Language Computer Science - Computers and Society
Druh dokumentu:	Working Paper
Popis:	This paper makes three contributions. First, it presents a generalizable, novel framework dubbed \textit{toxicity rabbit hole} that iteratively elicits toxic content from a wide suite of large language models. Spanning a set of 1,266 identity groups, we first conduct a bias audit of \texttt{PaLM 2} guardrails presenting key insights. Next, we report generalizability across several other models. Through the elicited toxic content, we present a broad analysis with a key emphasis on racism, antisemitism, misogyny, Islamophobia, homophobia, and transphobia. Finally, driven by concrete examples, we discuss potential ramifications.
Databáze:	arXiv
Externí odkaz:	http://arxiv.org/abs/2309.06415 Zobrazit plný text záznamu View this record from Arxiv