Self-training Improves Pre-training for Natural Language Understanding
Autor: | Edouard Grave, Veselin Stoyanov, Vishrav Chaudhary, Beliz Gunel, Jingfei Du, Onur Celebi, Michael Auli, Alexis Conneau |
---|---|
Rok vydání: | 2020 |
Předmět: |
FOS: Computer and information sciences
Computer Science - Computation and Language business.industry Computer science Natural language understanding 02 engineering and technology 010501 environmental sciences Machine learning computer.software_genre Variety (linguistics) 01 natural sciences Task (project management) Scalability 0202 electrical engineering electronic engineering information engineering Labeled data Leverage (statistics) 020201 artificial intelligence & image processing Artificial intelligence business computer Self training Computation and Language (cs.CL) 0105 earth and related environmental sciences |
Zdroj: | NAACL-HLT |
DOI: | 10.48550/arxiv.2010.02194 |
Popis: | Unsupervised pre-training has led to much recent progress in natural language understanding. In this paper, we study self-training as another way to leverage unlabeled data through semi-supervised learning. To obtain additional data for a specific task, we introduce SentAugment, a data augmentation method which computes task-specific query embeddings from labeled data to retrieve sentences from a bank of billions of unlabeled sentences crawled from the web. Unlike previous semi-supervised methods, our approach does not require in-domain unlabeled data and is therefore more generally applicable. Experiments show that self-training is complementary to strong RoBERTa baselines on a variety of tasks. Our augmentation approach leads to scalable and effective self-training with improvements of up to 2.6% on standard text classification benchmarks. Finally, we also show strong gains on knowledge-distillation and few-shot learning. Comment: 8 pages |
Databáze: | OpenAIRE |
Externí odkaz: |