Výsledky vyhledávání

Report

OLMoE: Open Mixture-of-Experts Language Models

We introduce OLMoE, a fully open, state-of-the-art language model leveraging sparse Mixture-of-Experts (MoE). OLMoE-1B-7B has 7 billion (B) parameters but uses only 1B per input token. We pretrain it on 5 trillion tokens and further adapt it to creat

Externí odkaz: http://arxiv.org/abs/2409.02060

Zobrazit plný text záznamu

Report

Paloma: A Benchmark for Evaluating Language Model Fit

Autor: Magnusson, Ian, Bhagia, Akshita, Hofmann, Valentin, Soldaini, Luca, Jha, Ananya Harsh, Tafjord, Oyvind, Schwenk, Dustin, Walsh, Evan Pete, Elazar, Yanai, Lo, Kyle, Groeneveld, Dirk, Beltagy, Iz, Hajishirzi, Hannaneh, Smith, Noah A., Richardson, Kyle, Dodge, Jesse

Language models (LMs) commonly report perplexity on monolithic data held out from training. Implicitly or explicitly, this data is composed of domains$\unicode{x2013}$varying distributions of language. Rather than assuming perplexity on one distribut

Externí odkaz: http://arxiv.org/abs/2312.10523

Zobrazit plný text záznamu

Report

Catwalk: A Unified Language Model Evaluation Framework for Many Datasets

Autor: Groeneveld, Dirk, Awadalla, Anas, Beltagy, Iz, Bhagia, Akshita, Magnusson, Ian, Peng, Hao, Tafjord, Oyvind, Walsh, Pete, Richardson, Kyle, Dodge, Jesse

The success of large language models has shifted the evaluation paradigms in natural language processing (NLP). The community's interest has drifted towards comparing NLP models across many tasks, domains, and datasets, often at an extreme scale. Thi

Externí odkaz: http://arxiv.org/abs/2312.10253

Zobrazit plný text záznamu

Report

What's In My Big Data?

Autor: Elazar, Yanai, Bhagia, Akshita, Magnusson, Ian, Ravichander, Abhilasha, Schwenk, Dustin, Suhr, Alane, Walsh, Pete, Groeneveld, Dirk, Soldaini, Luca, Singh, Sameer, Hajishirzi, Hanna, Smith, Noah A., Dodge, Jesse

Large text corpora are the backbone of language models. However, we have a limited understanding of the content of these corpora, including general statistics, quality, social factors, and inclusion of evaluation data (contamination). In this work, w

Externí odkaz: http://arxiv.org/abs/2310.20707

Zobrazit plný text záznamu

Report

Duration Dependence and Heterogeneity: Learning from Early Notice of Layoff

Autor: Bhagia, Div

This paper presents a novel approach to distinguish the impact of duration-dependent forces and adverse selection on the exit rate from unemployment by leveraging variation in the length of layoff notices. I formulate a Mixed Hazard model in discrete

Externí odkaz: http://arxiv.org/abs/2305.17344

Zobrazit plný text záznamu

Report

HINT: Hypernetwork Instruction Tuning for Efficient Zero- & Few-Shot Generalisation

Autor: Ivison, Hamish, Bhagia, Akshita, Wang, Yizhong, Hajishirzi, Hannaneh, Peters, Matthew

Recent NLP models have shown the remarkable ability to effectively generalise `zero-shot' to new tasks using only natural language instructions as guidance. However, many of these approaches suffer from high computational costs due to their reliance

Externí odkaz: http://arxiv.org/abs/2212.10315

Zobrazit plný text záznamu

Report

Continued Pretraining for Better Zero- and Few-Shot Promptability

Autor: Wu, Zhaofeng, Logan IV, Robert L., Walsh, Pete, Bhagia, Akshita, Groeneveld, Dirk, Singh, Sameer, Beltagy, Iz

Recently introduced language model prompting methods can achieve high accuracy in zero- and few-shot settings while requiring few to no learned task-specific parameters. Nevertheless, these methods still often trail behind full model finetuning. In t

Externí odkaz: http://arxiv.org/abs/2210.10258

Zobrazit plný text záznamu

Report

Alexa Teacher Model: Pretraining and Distilling Multi-Billion-Parameter Encoders for Natural Language Understanding Systems

Publikováno v: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD '22), August 14-18, 2022, Washington, DC, USA

We present results from a large-scale experiment on pretraining encoders with non-embedding parameter counts ranging from 700M to 9.3B, their subsequent distillation into smaller models ranging from 17M-170M parameters, and their application to the N

Externí odkaz: http://arxiv.org/abs/2206.07808

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání