Self-training Improves Pre-training for Natural Language Understanding

Autor:	Edouard Grave, Veselin Stoyanov, Vishrav Chaudhary, Beliz Gunel, Jingfei Du, Onur Celebi, Michael Auli, Alexis Conneau
Rok vydání:	2020
Předmět:	FOS: Computer and information sciences Computer Science - Computation and Language business.industry Computer science Natural language understanding 02 engineering and technology 010501 environmental sciences Machine learning computer.software_genre Variety (linguistics) 01 natural sciences Task (project management) Scalability 0202 electrical engineering electronic engineering information engineering Labeled data Leverage (statistics) 020201 artificial intelligence & image processing Artificial intelligence business computer Self training Computation and Language (cs.CL) 0105 earth and related environmental sciences
Zdroj:	NAACL-HLT
DOI:	10.48550/arxiv.2010.02194
Popis:	Unsupervised pre-training has led to much recent progress in natural language understanding. In this paper, we study self-training as another way to leverage unlabeled data through semi-supervised learning. To obtain additional data for a specific task, we introduce SentAugment, a data augmentation method which computes task-specific query embeddings from labeled data to retrieve sentences from a bank of billions of unlabeled sentences crawled from the web. Unlike previous semi-supervised methods, our approach does not require in-domain unlabeled data and is therefore more generally applicable. Experiments show that self-training is complementary to strong RoBERTa baselines on a variety of tasks. Our augmentation approach leads to scalable and effective self-training with improvements of up to 2.6% on standard text classification benchmarks. Finally, we also show strong gains on knowledge-distillation and few-shot learning. Comment: 8 pages
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::35d9966a169360a0ecb857f71449e7a5 Zobrazit plný text záznamu