Zobrazeno 1 - 10
of 323
pro vyhledávání: '"Zhang, Yushun"'
Recently, large language models (LLMs) have demonstrated remarkable capabilities in a wide range of tasks. Typically, an LLM is pre-trained on large corpora and subsequently fine-tuned on task-specific datasets. However, during fine-tuning, LLMs may
Externí odkaz:
http://arxiv.org/abs/2407.20999
Autor:
Zhang, Yushun, Chen, Congliang, Li, Ziniu, Ding, Tian, Wu, Chenwei, Ye, Yinyu, Luo, Zhi-Quan, Sun, Ruoyu
We propose Adam-mini, an optimizer that achieves on-par or better performance than AdamW with 45% to 50% less memory footprint. Adam-mini reduces memory by cutting down the learning rate resources in Adam (i.e., $1/\sqrt{v}$). We find that $\geq$ 90%
Externí odkaz:
http://arxiv.org/abs/2406.16793
SGD performs worse than Adam by a significant margin on Transformers, but the reason remains unclear. In this work, we provide an explanation through the lens of Hessian: (i) Transformers are "heterogeneous": the Hessian spectrum across parameter blo
Externí odkaz:
http://arxiv.org/abs/2402.16788
Reinforcement Learning from Human Feedback (RLHF) is key to aligning Large Language Models (LLMs), typically paired with the Proximal Policy Optimization (PPO) algorithm. While PPO is a powerful method designed for general reinforcement learning task
Externí odkaz:
http://arxiv.org/abs/2310.10505
Logs are valuable information for oil and gas fields as they help to determine the lithology of the formations surrounding the borehole and the location and reserves of subsurface oil and gas reservoirs. However, important logs are often missing in h
Externí odkaz:
http://arxiv.org/abs/2308.12625
Modern neural networks are often quite wide, causing large memory and computation costs. It is thus of great interest to train a narrower network. However, training narrow neural nets remains a challenging task. We ask two theoretical questions: Can
Externí odkaz:
http://arxiv.org/abs/2210.12001
Autor:
Wang, Bohan, Zhang, Yushun, Zhang, Huishuai, Meng, Qi, Sun, Ruoyu, Ma, Zhi-Ming, Liu, Tie-Yan, Luo, Zhi-Quan, Chen, Wei
Adam is widely adopted in practical applications due to its fast convergence. However, its theoretical analysis is still far from satisfactory. Existing convergence analyses for Adam rely on the bounded smoothness assumption, referred to as the \emph
Externí odkaz:
http://arxiv.org/abs/2208.09900
Ever since Reddi et al. 2018 pointed out the divergence issue of Adam, many new variants have been designed to obtain convergence. However, vanilla Adam remains exceptionally popular and it works well in practice. Why is there a gap between theory an
Externí odkaz:
http://arxiv.org/abs/2208.09632
Publikováno v:
In Heliyon 15 July 2024 10(13)
Autor:
Zhang, Yushun, Liu, Jian, Qiu, Xinqiang, Li, Wenfeng, Yang, Haochen, Qin, Haixia, Wang, Yanping, Wang, Min, Zhu, Hengkang
Publikováno v:
In Heliyon 15 April 2024 10(7)