Výsledky vyhledávání

Report

Sparse Repellency for Shielded Generation in Text-to-image Diffusion Models

Autor: Kirchhof, Michael, Thornton, James, Ablin, Pierre, Béthune, Louis, Ndiaye, Eugene, Cuturi, Marco

The increased adoption of diffusion models in text-to-image generation has triggered concerns on their reliability. Such models are now closely scrutinized under the lens of various metrics, notably calibration, fairness, or compute efficiency. We fo

Externí odkaz: http://arxiv.org/abs/2410.06025

Zobrazit plný text záznamu

Report

Dynamic Gradient Alignment for Online Data Mixing

Autor: Fan, Simin, Grangier, David, Ablin, Pierre

The composition of training data mixtures is critical for effectively training large language models (LLMs), as it directly impacts their performance on downstream tasks. Our goal is to identify an optimal data mixture to specialize an LLM for a spec

Externí odkaz: http://arxiv.org/abs/2410.02498

Zobrazit plný text záznamu

Report

Task-Adaptive Pretrained Language Models via Clustered-Importance Sampling

Autor: Grangier, David, Fan, Simin, Seto, Skyler, Ablin, Pierre

Specialist language models (LMs) focus on a specific task or domain on which they often outperform generalist LMs of the same size. However, the specialist data needed to pretrain these models is only available in limited amount for most tasks. In th

Externí odkaz: http://arxiv.org/abs/2410.03735

Zobrazit plný text záznamu

Report

Theory, Analysis, and Best Practices for Sigmoid Self-Attention

Autor: Ramapuram, Jason, Danieli, Federico, Dhekane, Eeshan, Weers, Floris, Busbridge, Dan, Ablin, Pierre, Likhomanenko, Tatiana, Digani, Jagrit, Gu, Zijin, Shidani, Amitis, Webb, Russ

Attention is a key part of the transformer architecture. It is a sequence-to-sequence mapping that transforms each sequence element into a weighted sum of values. The weights are typically obtained as the softmax of dot products between keys and quer

Externí odkaz: http://arxiv.org/abs/2409.04431

Zobrazit plný text záznamu

Report

The AdEMAMix Optimizer: Better, Faster, Older

Autor: Pagliardini, Matteo, Ablin, Pierre, Grangier, David

Momentum based optimizers are central to a wide range of machine learning applications. These typically rely on an Exponential Moving Average (EMA) of gradients, which decays exponentially the present contribution of older gradients. This accounts fo

Externí odkaz: http://arxiv.org/abs/2409.03137

Zobrazit plný text záznamu

Report

Optimization without Retraction on the Random Generalized Stiefel Manifold

Autor: Vary, Simon, Ablin, Pierre, Gao, Bin, Absil, P. -A.

Optimization over the set of matrices $X$ that satisfy $X^\top B X = I_p$, referred to as the generalized Stiefel manifold, appears in many applications involving sampled covariance matrices such as the canonical correlation analysis (CCA), independe

Externí odkaz: http://arxiv.org/abs/2405.01702

Zobrazit plný text záznamu

Report

Enhancing Hypergradients Estimation: A Study of Preconditioning and Reparameterization

Autor: Ye, Zhenzhang, Peyré, Gabriel, Cremers, Daniel, Ablin, Pierre

Bilevel optimization aims to optimize an outer objective function that depends on the solution to an inner optimization problem. It is routinely used in Machine Learning, notably for hyperparameter tuning. The conventional method to compute the so-ca

Externí odkaz: http://arxiv.org/abs/2402.16748

Zobrazit plný text záznamu

Report

Careful with that Scalpel: Improving Gradient Surgery with an EMA

Autor: Hsieh, Yu-Guan, Thornton, James, Ndiaye, Eugene, Klein, Michal, Cuturi, Marco, Ablin, Pierre

Beyond minimizing a single training loss, many deep learning estimation pipelines rely on an auxiliary objective to quantify and encourage desirable properties of the model (e.g. performance on another dataset, robustness, agreement with a prior). Al

Externí odkaz: http://arxiv.org/abs/2402.02998

Zobrazit plný text záznamu

Report

Need a Small Specialized Language Model? Plan Early!

Autor: Grangier, David, Katharopoulos, Angelos, Ablin, Pierre, Hannun, Awni

Large language models are versatile tools but are not suitable for small inference budgets. Small models have more efficient inference, but their lower capacity means that their performance can be good only if one limits their scope to a specialized

Externí odkaz: http://arxiv.org/abs/2402.01093

Zobrazit plný text záznamu

Report

How Smooth Is Attention?

Autor: Castin, Valérie, Ablin, Pierre, Peyré, Gabriel

Self-attention and masked self-attention are at the heart of Transformers' outstanding success. Still, our mathematical understanding of attention, in particular of its Lipschitz properties - which are key when it comes to analyzing robustness and ex

Externí odkaz: http://arxiv.org/abs/2312.14820

Zobrazit plný text záznamu

Vyhledávací nástroje:

Upřesnit hledání