Výsledky vyhledávání - "Paul, Mansheej"

Report

Autor: Ankner, Zachary, Paul, Mansheej, Cui, Brandon, Chang, Jonathan D., Ammanabrolu, Prithviraj

Traditionally, reward models used for reinforcement learning from human feedback (RLHF) are trained to directly predict preference scores without leveraging the generation capabilities of the underlying large language model (LLM). This limits the cap

Externí odkaz: http://arxiv.org/abs/2408.11791

Zobrazit plný text záznamu

Report

Does your data spark joy? Performance gains from domain upsampling at the end of training

Autor: Blakeney, Cody, Paul, Mansheej, Larsen, Brett W., Owen, Sean, Frankle, Jonathan

Pretraining datasets for large language models (LLMs) have grown to trillions of tokens composed of large amounts of CommonCrawl (CC) web scrape along with smaller, domain-specific datasets. It is expensive to understand the impact of these domain-sp

Externí odkaz: http://arxiv.org/abs/2406.03476

Zobrazit plný text záznamu

Report

Perplexed by Perplexity: Perplexity-Based Data Pruning With Small Reference Models

Autor: Ankner, Zachary, Blakeney, Cody, Sreenivasan, Kartik, Marion, Max, Leavitt, Matthew L., Paul, Mansheej

In this work, we investigate whether small language models can determine high-quality subsets of large-scale text datasets that improve the performance of larger language models. While existing work has shown that pruning based on the perplexity of a

Externí odkaz: http://arxiv.org/abs/2405.20541

Zobrazit plný text záznamu

Report

LoRA Learns Less and Forgets Less

Autor: Biderman, Dan, Portes, Jacob, Ortiz, Jose Javier Gonzalez, Paul, Mansheej, Greengard, Philip, Jennings, Connor, King, Daniel, Havens, Sam, Chiley, Vitaliy, Frankle, Jonathan, Blakeney, Cody, Cunningham, John P.

Low-Rank Adaptation (LoRA) is a widely-used parameter-efficient finetuning method for large language models. LoRA saves memory by training only low rank perturbations to selected weight matrices. In this work, we compare the performance of LoRA and f

Externí odkaz: http://arxiv.org/abs/2405.09673

Zobrazit plný text záznamu

Report

Pretraining task diversity and the emergence of non-Bayesian in-context learning for regression

Autor: Raventós, Allan, Paul, Mansheej, Chen, Feng, Ganguli, Surya

Pretrained transformers exhibit the remarkable ability of in-context learning (ICL): they can learn tasks from just a few examples provided in the prompt without updating any weights. This raises a foundational question: can ICL solve fundamentally $

Externí odkaz: http://arxiv.org/abs/2306.15063

Zobrazit plný text záznamu

Report

Unmasking the Lottery Ticket Hypothesis: What's Encoded in a Winning Ticket's Mask?

Autor: Paul, Mansheej, Chen, Feng, Larsen, Brett W., Frankle, Jonathan, Ganguli, Surya, Dziugaite, Gintare Karolina

Modern deep learning involves training costly, highly overparameterized networks, thus motivating the search for sparser networks that can still be trained to the same accuracy as the full network (i.e. matching). Iterative magnitude pruning (IMP) is

Externí odkaz: http://arxiv.org/abs/2210.03044

Zobrazit plný text záznamu

Report

Lottery Tickets on a Data Diet: Finding Initializations with Sparse Trainable Networks

Autor: Paul, Mansheej, Larsen, Brett W., Ganguli, Surya, Frankle, Jonathan, Dziugaite, Gintare Karolina

A striking observation about iterative magnitude pruning (IMP; Frankle et al. 2020) is that $\unicode{x2014}$ after just a few hundred steps of dense training $\unicode{x2014}$ the method can find a sparse sub-network that can be trained to the same

Externí odkaz: http://arxiv.org/abs/2206.01278

Zobrazit plný text záznamu

Report

Deep Learning on a Data Diet: Finding Important Examples Early in Training

Autor: Paul, Mansheej, Ganguli, Surya, Dziugaite, Gintare Karolina

Publikováno v: Advances in Neural Information Processing Systems 34 (NeurIPS 2021)

Recent success in deep learning has partially been driven by training increasingly overparametrized networks on ever larger datasets. It is therefore natural to ask: how much of the data is superfluous, which examples are important for generalization

Externí odkaz: http://arxiv.org/abs/2107.07075

Zobrazit plný text záznamu

Report

Deep learning versus kernel learning: an empirical study of loss landscape geometry and the time evolution of the Neural Tangent Kernel

Autor: Fort, Stanislav, Dziugaite, Gintare Karolina, Paul, Mansheej, Kharaghani, Sepideh, Roy, Daniel M., Ganguli, Surya

In suitably initialized wide networks, small learning rates transform deep neural networks (DNNs) into neural tangent kernel (NTK) machines, whose training dynamics is well-approximated by a linear weight expansion of the network at initialization. S

Externí odkaz: http://arxiv.org/abs/2010.15110

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání