Zobrazeno 1 - 10
of 21
pro vyhledávání: '"Noci, Lorenzo"'
Outlier Features (OFs) are neurons whose activation magnitudes significantly exceed the average over a neural network's (NN) width. They are well known to emerge during standard transformer training and have the undesirable effect of hindering quanti
Externí odkaz:
http://arxiv.org/abs/2405.19279
Recently, there has been growing evidence that if the width and depth of a neural network are scaled toward the so-called rich feature learning limit (\mup and its depth extension), then some hyperparameters -- such as the learning rate -- exhibit tr
Externí odkaz:
http://arxiv.org/abs/2402.17457
The multi-modal nature of neural loss landscapes is often considered to be the main driver behind the empirical success of deep ensembles. In this work, we probe this belief by constructing various "connected" ensembles which are restricted to lie in
Externí odkaz:
http://arxiv.org/abs/2402.03187
Linear mode-connectivity (LMC) (or lack thereof) is one of the intriguing characteristics of neural network loss landscapes. While empirically well established, it unfortunately still lacks a proper theoretical understanding. Even worse, although emp
Externí odkaz:
http://arxiv.org/abs/2312.09832
The cost of hyperparameter tuning in deep learning has been rising with model sizes, prompting practitioners to find new tuning methods using a proxy of smaller networks. One such proposal uses $\mu$P parameterized networks, where the optimal hyperpa
Externí odkaz:
http://arxiv.org/abs/2309.16620
Autor:
Noci, Lorenzo, Li, Chuning, Li, Mufan Bill, He, Bobby, Hofmann, Thomas, Maddison, Chris, Roy, Daniel M.
In deep learning theory, the covariance matrix of the representations serves as a proxy to examine the network's trainability. Motivated by the success of Transformers, we study the covariance matrix of a modified Softmax-based attention model with s
Externí odkaz:
http://arxiv.org/abs/2306.17759
Autor:
Anagnostidis, Sotiris, Pavllo, Dario, Biggio, Luca, Noci, Lorenzo, Lucchi, Aurelien, Hofmann, Thomas
Autoregressive Transformers adopted in Large Language Models (LLMs) are hard to scale to long sequences. Despite several works trying to reduce their computational cost, most of LLMs still adopt attention layers between all pairs of tokens in the seq
Externí odkaz:
http://arxiv.org/abs/2305.15805
In contrast to the natural capabilities of humans to learn new tasks in a sequential fashion, neural networks are known to suffer from catastrophic forgetting, where the model's performances on old tasks drop dramatically after being optimized for a
Externí odkaz:
http://arxiv.org/abs/2303.09483
Despite the empirical advances of deep learning across a variety of learning tasks, our theoretical understanding of its success is still very restricted. One of the key challenges is the overparametrized nature of modern models, enabling complete ov
Externí odkaz:
http://arxiv.org/abs/2210.14019
Autor:
Noci, Lorenzo, Anagnostidis, Sotiris, Biggio, Luca, Orvieto, Antonio, Singh, Sidak Pal, Lucchi, Aurelien
Transformers have achieved remarkable success in several domains, ranging from natural language processing to computer vision. Nevertheless, it has been recently shown that stacking self-attention layers - the distinctive architectural component of T
Externí odkaz:
http://arxiv.org/abs/2206.03126