Zobrazeno 1 - 10
of 178
pro vyhledávání: '"STEINHARDT, JACOB"'
Large language models (LLMs) often exhibit subtle yet distinctive characteristics in their outputs that users intuitively recognize, but struggle to quantify. These "vibes" -- such as tone, formatting, or writing style -- influence user preferences,
Externí odkaz:
http://arxiv.org/abs/2410.12851
Autor:
Wen, Jiaxin, Zhong, Ruiqi, Khan, Akbir, Perez, Ethan, Steinhardt, Jacob, Huang, Minlie, Bowman, Samuel R., He, He, Feng, Shi
Language models (LMs) can produce errors that are hard to detect for humans, especially when the task is complex. RLHF, the most popular post-training method, may exacerbate this problem: to achieve higher rewards, LMs might get better at convincing
Externí odkaz:
http://arxiv.org/abs/2409.12822
To make sense of massive data, we often fit simplified models and then interpret the parameters; for example, we cluster the text embeddings and then interpret the mean parameters of each cluster. However, these parameters are often high-dimensional
Externí odkaz:
http://arxiv.org/abs/2409.08466
Emerging marketplaces for large language models and other large-scale machine learning (ML) models appear to exhibit market concentration, which has raised concerns about whether there are insurmountable barriers to entry in such markets. In this wor
Externí odkaz:
http://arxiv.org/abs/2409.03734
Autor:
Halawi, Danny, Wei, Alexander, Wallace, Eric, Wang, Tony T., Haghtalab, Nika, Steinhardt, Jacob
Black-box finetuning is an emerging interface for adapting state-of-the-art language models to user needs. However, such access may also let malicious actors undermine model safety. To demonstrate the challenge of defending finetuning interfaces, we
Externí odkaz:
http://arxiv.org/abs/2406.20053
Language models are susceptible to bias, sycophancy, backdoors, and other tendencies that lead to unfaithful responses to the input context. Interpreting internal states of language models could help monitor and correct unfaithful behavior. We hypoth
Externí odkaz:
http://arxiv.org/abs/2406.19501
Developers try to evaluate whether an AI system can be misused by adversaries before releasing it; for example, they might test whether a model enables cyberoffense, user manipulation, or bioterrorism. In this work, we show that individually testing
Externí odkaz:
http://arxiv.org/abs/2406.14595
We interpret the function of individual neurons in CLIP by automatically describing them using text. Analyzing the direct effects (i.e. the flow from a neuron through the residual stream to the output) or the indirect effects (overall contribution) f
Externí odkaz:
http://arxiv.org/abs/2406.04341
Forecasting future events is important for policy and decision making. In this work, we study whether language models (LMs) can forecast at the level of competitive human forecasters. Towards this goal, we develop a retrieval-augmented LM system desi
Externí odkaz:
http://arxiv.org/abs/2402.18563
Language models influence the external world: they query APIs that read and write to web pages, generate content that shapes human behavior, and run system commands as autonomous agents. These interactions form feedback loops: LLM outputs affect the
Externí odkaz:
http://arxiv.org/abs/2402.06627