Large-scale, diverse, paraphrastic bitexts via sampling and clustering

Autor:	Matt Post, Nils Holzenberger, Benjamin Van Durme, Abhinav Singh, J. Edward Hu
Předmět:	Computer science business.industry Inference Sampling (statistics) 02 engineering and technology computer.software_genre Paraphrase Task (project management) Resource (project management) 020204 information systems 0202 electrical engineering electronic engineering information engineering Beam search 020201 artificial intelligence & image processing Artificial intelligence business Cluster analysis computer Sentence Natural language processing
Zdroj:	Scopus-Elsevier CoNLL
Popis:	Producing diverse paraphrases of a sentence is a challenging task. Natural paraphrase corpora are scarce and limited, while existing large-scale resources are automatically generated via back-translation and rely on beam search, which tends to lack diversity. We describe ParaBank 2, a new resource that contains multiple diverse sentential paraphrases, produced from a bilingual corpus using negative constraints, inference sampling, and clustering.We show that ParaBank 2 significantly surpasses prior work in both lexical and syntactic diversity while being meaning-preserving, as measured by human judgments and standardized metrics. Further, we illustrate how such paraphrastic resources may be used to refine contextualized encoders, leading to improvements in downstream tasks.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::a9569a1c38c4f8a4a079bb9547768ca8 http://www.scopus.com/inward/record.url?eid=2-s2.0-85084331385&partnerID=MN8TOARS Zobrazit plný text záznamu