Improving Deep Transformer with Depth-Scaled Initialization and Merged Attention

Autor:	Zhang, B., Titov, I., Sennrich, R., Inui, K., Jiang, J., Ng, V., Wan, X.
Přispěvatelé:	University of Zurich, ILLC (FNWI), Faculty of Science, Brain and Cognition, Language and Computation (ILLC, FNWI/FGw)
Jazyk:	angličtina
Rok vydání:	2019
Předmět:	FOS: Computer and information sciences Normalization (statistics) Source code Machine translation Computer science media_common.quotation_subject Initialization 410 Linguistics 02 engineering and technology 000 Computer science knowledge & systems 010501 environmental sciences Residual computer.software_genre 01 natural sciences 0202 electrical engineering electronic engineering information engineering 0105 earth and related environmental sciences Transformer (machine learning model) media_common Computer Science - Computation and Language Artificial neural network 10105 Institute of Computational Linguistics 020201 artificial intelligence & image processing Computation and Language (cs.CL) Algorithm computer Decoding methods
Zdroj:	Zhang, B, Titov, I & Sennrich, R 2019, Improving Deep Transformer with Depth-Scaled Initialization and Merged Attention . in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing . pp. 898–909, 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing, Hong Kong, Hong Kong, 3/11/19 . https://doi.org/10.18653/v1/D19-1083 EMNLP/IJCNLP (1) 2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing: EMNLP-IJCNLP 2019 : proceedings of the conference : November 3-7, 2019, Hong Kong, China, 898-909 STARTPAGE=898;ENDPAGE=909;TITLE=2019 Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing
DOI:	10.18653/v1/D19-1083
Popis:	The general trend in NLP is towards increasing model capacity and performance via deeper neural networks. However, simply stacking more layers of the popular Transformer architecture for machine translation results in poor convergence and high computational overhead. Our empirical analysis suggests that convergence is poor due to gradient vanishing caused by the interaction between residual connections and layer normalization. We propose depth-scaled initialization (DS-Init), which decreases parameter variance at the initialization stage, and reduces output variance of residual connections so as to ease gradient back-propagation through normalization layers. To address computational cost, we propose a merged attention sublayer (MAtt) which combines a simplified averagebased self-attention sublayer and the encoderdecoder attention sublayer on the decoder side. Results on WMT and IWSLT translation tasks with five translation directions show that deep Transformers with DS-Init and MAtt can substantially outperform their base counterpart in terms of BLEU (+1.1 BLEU on average for 12-layer models), while matching the decoding speed of the baseline model thanks to the efficiency improvements of MAtt. EMNLP2019
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::79868d75404019d79cdb6da1dd22af31 https://www.pure.ed.ac.uk/ws/files/129171008/Improving_Deep_Transformer_ZHANG_DOA04112019_VOR_CC_BY.pdf Zobrazit plný text záznamu