Zobrazeno 1 - 10
of 386
pro vyhledávání: '"DELALLEAU, A."'
Autor:
Zhang, Michael JQ, Wang, Zhilin, Hwang, Jena D., Dong, Yi, Delalleau, Olivier, Choi, Yejin, Choi, Eunsol, Ren, Xiang, Pyatkin, Valentina
We examine diverging preferences in human-labeled preference datasets. We develop a taxonomy of disagreement sources spanning 10 categories across four high-level classes -- task underspecification, response style, refusals, and annotation errors. We
Externí odkaz:
http://arxiv.org/abs/2410.14632
Autor:
Wang, Zhilin, Bukharin, Alexander, Delalleau, Olivier, Egert, Daniel, Shen, Gerald, Zeng, Jiaqi, Kuchaiev, Oleksii, Dong, Yi
Reward models are critical for aligning models to follow instructions, and are typically trained following one of two popular paradigms: Bradley-Terry style or Regression style. However, there is a lack of evidence that either approach is better than
Externí odkaz:
http://arxiv.org/abs/2410.01257
Autor:
Nvidia, Adler, Bo, Agarwal, Niket, Aithal, Ashwath, Anh, Dong H., Bhattacharya, Pallab, Brundyn, Annika, Casper, Jared, Catanzaro, Bryan, Clay, Sharon, Cohen, Jonathan, Das, Sirshak, Dattagupta, Ayush, Delalleau, Olivier, Derczynski, Leon, Dong, Yi, Egert, Daniel, Evans, Ellie, Ficek, Aleksander, Fridman, Denys, Ghosh, Shaona, Ginsburg, Boris, Gitman, Igor, Grzegorzek, Tomasz, Hero, Robert, Huang, Jining, Jawa, Vibhu, Jennings, Joseph, Jhunjhunwala, Aastha, Kamalu, John, Khan, Sadaf, Kuchaiev, Oleksii, LeGresley, Patrick, Li, Hui, Liu, Jiwei, Liu, Zihan, Long, Eileen, Mahabaleshwarkar, Ameya Sunil, Majumdar, Somshubra, Maki, James, Martinez, Miguel, de Melo, Maer Rodrigues, Moshkov, Ivan, Narayanan, Deepak, Narenthiran, Sean, Navarro, Jesus, Nguyen, Phong, Nitski, Osvald, Noroozi, Vahid, Nutheti, Guruprasad, Parisien, Christopher, Parmar, Jupinder, Patwary, Mostofa, Pawelec, Krzysztof, Ping, Wei, Prabhumoye, Shrimai, Roy, Rajarshi, Saar, Trisha, Sabavat, Vasanth Rao Naik, Satheesh, Sanjeev, Scowcroft, Jane Polak, Sewall, Jason, Shamis, Pavel, Shen, Gerald, Shoeybi, Mohammad, Sizer, Dave, Smelyanskiy, Misha, Soares, Felipe, Sreedhar, Makesh Narsimhan, Su, Dan, Subramanian, Sandeep, Sun, Shengyang, Toshniwal, Shubham, Wang, Hao, Wang, Zhilin, You, Jiaxuan, Zeng, Jiaqi, Zhang, Jimmy, Zhang, Jing, Zhang, Vivienne, Zhang, Yian, Zhu, Chen
We release the Nemotron-4 340B model family, including Nemotron-4-340B-Base, Nemotron-4-340B-Instruct, and Nemotron-4-340B-Reward. Our models are open access under the NVIDIA Open Model License Agreement, a permissive model license that allows distri
Externí odkaz:
http://arxiv.org/abs/2406.11704
Autor:
Wang, Zhilin, Dong, Yi, Delalleau, Olivier, Zeng, Jiaqi, Shen, Gerald, Egert, Daniel, Zhang, Jimmy J., Sreedhar, Makesh Narsimhan, Kuchaiev, Oleksii
High-quality preference datasets are essential for training reward models that can effectively guide large language models (LLMs) in generating high-quality responses aligned with human preferences. As LLMs become stronger and better aligned, permiss
Externí odkaz:
http://arxiv.org/abs/2406.08673
Autor:
Shen, Gerald, Wang, Zhilin, Delalleau, Olivier, Zeng, Jiaqi, Dong, Yi, Egert, Daniel, Sun, Shengyang, Zhang, Jimmy, Jain, Sahil, Taghibakhshi, Ali, Ausin, Markel Sanz, Aithal, Ashwath, Kuchaiev, Oleksii
Aligning Large Language Models (LLMs) with human values and preferences is essential for making them helpful and safe. However, building efficient tools to perform alignment can be challenging, especially for the largest and most competent LLMs which
Externí odkaz:
http://arxiv.org/abs/2405.01481
Autor:
Wang, Zhilin, Dong, Yi, Zeng, Jiaqi, Adams, Virginia, Sreedhar, Makesh Narsimhan, Egert, Daniel, Delalleau, Olivier, Scowcroft, Jane Polak, Kant, Neel, Swope, Aidan, Kuchaiev, Oleksii
Existing open-source helpfulness preference datasets do not specify what makes some responses more helpful and others less so. Models trained on these datasets can incidentally learn to model dataset artifacts (e.g. preferring longer but unhelpful re
Externí odkaz:
http://arxiv.org/abs/2311.09528
Autor:
Chitnis, Rohan, Xu, Yingchen, Hashemi, Bobak, Lehnert, Lucas, Dogan, Urun, Zhu, Zheqing, Delalleau, Olivier
Publikováno v:
Short version published at ICRA 2024 (https://tinyurl.com/icra24-iqltdmpc)
Model-based reinforcement learning (RL) has shown great promise due to its sample efficiency, but still struggles with long-horizon sparse-reward tasks, especially in offline settings where the agent learns from a fixed dataset. We hypothesize that m
Externí odkaz:
http://arxiv.org/abs/2306.00867
Autor:
Sodhani, Shagun, Delalleau, Olivier, Assran, Mahmoud, Sinha, Koustuv, Ballas, Nicolas, Rabbat, Michael
Codistillation has been proposed as a mechanism to share knowledge among concurrently trained models by encouraging them to represent the same function through an auxiliary loss. This contrasts with the more commonly used fully-synchronous data-paral
Externí odkaz:
http://arxiv.org/abs/2010.02838
Akademický článek
Tento výsledek nelze pro nepřihlášené uživatele zobrazit.
K zobrazení výsledku je třeba se přihlásit.
K zobrazení výsledku je třeba se přihlásit.
While most current research in Reinforcement Learning (RL) focuses on improving the performance of the algorithms in controlled environments, the use of RL under constraints like those met in the video game industry is rarely studied. Operating under
Externí odkaz:
http://arxiv.org/abs/1912.11077