Výsledky vyhledávání - "Wang, Tony T."

Report

Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation

Autor: Halawi, Danny, Wei, Alexander, Wallace, Eric, Wang, Tony T., Haghtalab, Nika, Steinhardt, Jacob

Black-box finetuning is an emerging interface for adapting state-of-the-art language models to user needs. However, such access may also let malicious actors undermine model safety. To demonstrate the challenge of defending finetuning interfaces, we

Externí odkaz: http://arxiv.org/abs/2406.20053

Zobrazit plný text záznamu

Report

Can Go AIs be adversarially robust?

Autor: Tseng, Tom, McLean, Euan, Pelrine, Kellin, Wang, Tony T., Gleave, Adam

Prior work found that superhuman Go AIs can be defeated by simple adversarial strategies, especially "cyclic" attacks. In this paper, we study whether adding natural countermeasures can achieve robustness in Go, a favorable domain for robustness sinc

Externí odkaz: http://arxiv.org/abs/2406.12843

Zobrazit plný text záznamu

Report

Forbidden Facts: An Investigation of Competing Objectives in Llama-2

Autor: Wang, Tony T., Wang, Miles, Hariharan, Kaivalya, Shavit, Nir

LLMs often face competing pressures (for example helpfulness vs. harmlessness). To understand how models resolve such conflicts, we study Llama-2-chat models on the forbidden fact task. Specifically, we instruct Llama-2 to truthfully complete a factu

Externí odkaz: http://arxiv.org/abs/2312.08793

Zobrazit plný text záznamu

Report

Cliff-Learning

Autor: Wang, Tony T., Zablotchi, Igor, Shavit, Nir, Rosenfeld, Jonathan S.

We study the data-scaling of transfer learning from foundation models in the low-downstream-data regime. We observe an intriguing phenomenon which we call cliff-learning. Cliff-learning refers to regions of data-scaling laws where performance improve

Externí odkaz: http://arxiv.org/abs/2302.07348

Zobrazit plný text záznamu

Report

Adversarial Policies Beat Superhuman Go AIs

Autor: Wang, Tony T., Gleave, Adam, Tseng, Tom, Pelrine, Kellin, Belrose, Nora, Miller, Joseph, Dennis, Michael D., Duan, Yawen, Pogrebniak, Viktor, Levine, Sergey, Russell, Stuart

We attack the state-of-the-art Go-playing AI system KataGo by training adversarial policies against it, achieving a >97% win rate against KataGo running at superhuman settings. Our adversaries do not win by playing Go well. Instead, they trick KataGo

Externí odkaz: http://arxiv.org/abs/2211.00241

Zobrazit plný text záznamu

Dissertation/ Thesis

Adversarial Examples in Simpler Settings

Autor: Wang, Tony T.

In this thesis we explore adversarial examples for simple model families and simple data distributions, focusing in particular on linear and kernel classifiers. On the theoretical front we find evidence that natural accuracy and robust accuracy are m

Externí odkaz: https://hdl.handle.net/1721.1/139041

Zobrazit plný text záznamu

Akademický článek

Tento výsledek nelze pro nepřihlášené uživatele zobrazit.
K zobrazení výsledku je třeba se přihlásit.

Akademický článek

Tento výsledek nelze pro nepřihlášené uživatele zobrazit.
K zobrazení výsledku je třeba se přihlásit.

Akademický článek

Tento výsledek nelze pro nepřihlášené uživatele zobrazit.
K zobrazení výsledku je třeba se přihlásit.

Akademický článek

Tento výsledek nelze pro nepřihlášené uživatele zobrazit.
K zobrazení výsledku je třeba se přihlásit.

Vyhledávací nástroje:

Upřesnit hledání