Mechanism and Emergence of Stacked Attention Heads in Multi-Layer Transformers

Autor:	Musat, Tiberiu
Rok vydání:	2024
Předmět:	Computer Science - Machine Learning Computer Science - Computation and Language
Druh dokumentu:	Working Paper
Popis:	In this paper, I introduce the retrieval problem, a simple reasoning task that can be solved only by transformers with a minimum number of layers. The task has an adjustable difficulty that can further increase the required number of layers to any arbitrary value. I demonstrate that large language models can solve the task under different prompting formulations without any fine-tuning. To understand how transformers solve the retrieval problem, I train several transformers on a minimal formulation. I find that successful learning occurs only under the presence of an implicit curriculum. I uncover the learned mechanisms by studying the attention maps in the trained transformers. I also study the training process, uncovering that attention heads always emerge in a specific sequence.
Databáze:	arXiv
Externí odkaz:	http://arxiv.org/abs/2411.12118 Zobrazit plný text záznamu View this record from Arxiv