Zobrazeno 1 - 10
of 354
pro vyhledávání: '"Sra, Suvrit"'
We show theoretically and empirically that the linear Transformer, when applied to graph data, can implement algorithms that solve canonical problems such as electric flow and eigenvector decomposition. The input to the Transformer is simply the grap
Externí odkaz:
http://arxiv.org/abs/2410.16699
Autor:
Dutta, Sanchayan, Sra, Suvrit
We show that memory-augmented Transformers (Memformers) can implement linear first-order optimization methods such as conjugate gradient descent, momentum methods, and more generally, methods that linearly combine past gradients. Building on prior wo
Externí odkaz:
http://arxiv.org/abs/2410.07263
Algorithms for bilevel optimization often encounter Hessian computations, which are prohibitive in high dimensions. While recent works offer first-order methods for unconstrained bilevel problems, the constrained setting remains relatively underexplo
Externí odkaz:
http://arxiv.org/abs/2406.12771
We develop new algorithms for Riemannian bilevel optimization. We focus in particular on batch and stochastic gradient-based methods, with the explicit goal of avoiding second-order information such as Riemannian hyper-gradients. We propose and analy
Externí odkaz:
http://arxiv.org/abs/2405.15816
We study the task of efficiently sampling from a Gibbs distribution $d \pi^* = e^{-h} d {vol}_g$ over a Riemannian manifold $M$ via (geometric) Langevin MCMC; this algorithm involves computing exponential maps in random Gaussian directions and is eff
Externí odkaz:
http://arxiv.org/abs/2402.10357
Many neural network architectures are known to be Turing Complete, and can thus, in principle implement arbitrary algorithms. However, Transformers are unique in that they can implement gradient-based learning algorithms under simple parameter config
Externí odkaz:
http://arxiv.org/abs/2312.06528
Transformer training is notoriously difficult, requiring a careful design of optimizers and use of various heuristics. We make progress towards understanding the subtleties of training Transformers by carefully studying a simple yet canonical lineari
Externí odkaz:
http://arxiv.org/abs/2310.01082
Invex programs are a special kind of non-convex problems which attain global minima at every stationary point. While classical first-order gradient descent methods can solve them, they converge very slowly. In this paper, we propose new first-order a
Externí odkaz:
http://arxiv.org/abs/2307.04456
Publikováno v:
37th Conference on Neural Information Processing Systems (NeurIPS 2023)
Several recent works demonstrate that transformers can implement algorithms like gradient descent. By a careful construction of weights, these works show that multiple layers of transformers are expressive enough to simulate iterations of gradient de
Externí odkaz:
http://arxiv.org/abs/2306.00297
Modern machine learning applications have witnessed the remarkable success of optimization algorithms that are designed to find flat minima. Motivated by this design choice, we undertake a formal study that (i) formulates the notion of flat minima, a
Externí odkaz:
http://arxiv.org/abs/2305.15659