Výsledky vyhledávání

Report

Future Events as Backdoor Triggers: Investigating Temporal Vulnerabilities in LLMs

Autor: Price, Sara, Panickssery, Arjun, Bowman, Sam, Stickland, Asa Cooper

Backdoors are hidden behaviors that are only triggered once an AI system has been deployed. Bad actors looking to create successful backdoors must design them to avoid activation during training and evaluation. Since data used in these stages often o

Externí odkaz: http://arxiv.org/abs/2407.04108

Zobrazit plný text záznamu

Report

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

We describe our early efforts to red team language models in order to simultaneously discover, measure, and attempt to reduce their potentially harmful outputs. We make three main contributions. First, we investigate scaling behaviors for red teaming

Externí odkaz: http://arxiv.org/abs/2209.07858

Zobrazit plný text záznamu

Report

Language Models (Mostly) Know What They Know

We study whether language models can evaluate the validity of their own claims and predict which questions they will be able to answer correctly. We first show that larger models are well-calibrated on diverse multiple choice and true/false questions

Externí odkaz: http://arxiv.org/abs/2207.05221

Zobrazit plný text záznamu

Report

Detecting and Explaining Crisis

Autor: Kshirsagar, Rohan, Morris, Robert, Bowman, Sam

Individuals on social media may reveal themselves to be in various states of crisis (e.g. suicide, self-harm, abuse, or eating disorders). Detecting crisis from social media text automatically and accurately can have profound consequences. However, d

Externí odkaz: http://arxiv.org/abs/1705.09585

Zobrazit plný text záznamu