Výsledky vyhledávání - "STEINHARDT, JACOB"

Report

VibeCheck: Discover and Quantify Qualitative Differences in Large Language Models

Autor: Dunlap, Lisa, Mandal, Krishna, Darrell, Trevor, Steinhardt, Jacob, Gonzalez, Joseph E

Large language models (LLMs) often exhibit subtle yet distinctive characteristics in their outputs that users intuitively recognize, but struggle to quantify. These "vibes" -- such as tone, formatting, or writing style -- influence user preferences,

Externí odkaz: http://arxiv.org/abs/2410.12851

Zobrazit plný text záznamu

Report

Language Models Learn to Mislead Humans via RLHF

Autor: Wen, Jiaxin, Zhong, Ruiqi, Khan, Akbir, Perez, Ethan, Steinhardt, Jacob, Huang, Minlie, Bowman, Samuel R., He, He, Feng, Shi

Language models (LMs) can produce errors that are hard to detect for humans, especially when the task is complex. RLHF, the most popular post-training method, may exacerbate this problem: to achieve higher rewards, LMs might get better at convincing

Externí odkaz: http://arxiv.org/abs/2409.12822

Zobrazit plný text záznamu

Report

Explaining Datasets in Words: Statistical Models with Natural Language Parameters

Autor: Zhong, Ruiqi, Wang, Heng, Klein, Dan, Steinhardt, Jacob

To make sense of massive data, we often fit simplified models and then interpret the parameters; for example, we cluster the text embeddings and then interpret the mean parameters of each cluster. However, these parameters are often high-dimensional

Externí odkaz: http://arxiv.org/abs/2409.08466

Zobrazit plný text záznamu

Report

Safety vs. Performance: How Multi-Objective Learning Reduces Barriers to Market Entry

Autor: Jagadeesan, Meena, Jordan, Michael I., Steinhardt, Jacob

Emerging marketplaces for large language models and other large-scale machine learning (ML) models appear to exhibit market concentration, which has raised concerns about whether there are insurmountable barriers to entry in such markets. In this wor

Externí odkaz: http://arxiv.org/abs/2409.03734

Zobrazit plný text záznamu

Report

Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation

Autor: Halawi, Danny, Wei, Alexander, Wallace, Eric, Wang, Tony T., Haghtalab, Nika, Steinhardt, Jacob

Black-box finetuning is an emerging interface for adapting state-of-the-art language models to user needs. However, such access may also let malicious actors undermine model safety. To demonstrate the challenge of defending finetuning interfaces, we

Externí odkaz: http://arxiv.org/abs/2406.20053

Zobrazit plný text záznamu

Report

Monitoring Latent World States in Language Models with Propositional Probes

Autor: Feng, Jiahai, Russell, Stuart, Steinhardt, Jacob

Language models are susceptible to bias, sycophancy, backdoors, and other tendencies that lead to unfaithful responses to the input context. Interpreting internal states of language models could help monitor and correct unfaithful behavior. We hypoth

Externí odkaz: http://arxiv.org/abs/2406.19501

Zobrazit plný text záznamu

Report

Adversaries Can Misuse Combinations of Safe Models

Autor: Jones, Erik, Dragan, Anca, Steinhardt, Jacob

Developers try to evaluate whether an AI system can be misused by adversaries before releasing it; for example, they might test whether a model enables cyberoffense, user manipulation, or bioterrorism. In this work, we show that individually testing

Externí odkaz: http://arxiv.org/abs/2406.14595

Zobrazit plný text záznamu

Report

Interpreting the Second-Order Effects of Neurons in CLIP

Autor: Gandelsman, Yossi, Efros, Alexei A., Steinhardt, Jacob

We interpret the function of individual neurons in CLIP by automatically describing them using text. Analyzing the direct effects (i.e. the flow from a neuron through the residual stream to the output) or the indirect effects (overall contribution) f

Externí odkaz: http://arxiv.org/abs/2406.04341

Zobrazit plný text záznamu

Report

Approaching Human-Level Forecasting with Language Models

Autor: Halawi, Danny, Zhang, Fred, Yueh-Han, Chen, Steinhardt, Jacob

Forecasting future events is important for policy and decision making. In this work, we study whether language models (LMs) can forecast at the level of competitive human forecasters. Towards this goal, we develop a retrieval-augmented LM system desi

Externí odkaz: http://arxiv.org/abs/2402.18563

Zobrazit plný text záznamu

Report

Feedback Loops With Language Models Drive In-Context Reward Hacking

Autor: Pan, Alexander, Jones, Erik, Jagadeesan, Meena, Steinhardt, Jacob

Language models influence the external world: they query APIs that read and write to web pages, generate content that shapes human behavior, and run system commands as autonomous agents. These interactions form feedback loops: LLM outputs affect the

Externí odkaz: http://arxiv.org/abs/2402.06627

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání