A survey on training and evaluation of word embeddings

Autor:	Guillaume Gravier, Nihel Kooli, Robin Allesiardo, François Torregrossa, Vincent Claveau
Přispěvatelé:	Solocal, Creating and exploiting explicit links between multimedia fragments (LinkMedia), Inria Rennes – Bretagne Atlantique, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-MEDIA ET INTERACTIONS (IRISA-D6), Institut de Recherche en Informatique et Systèmes Aléatoires (IRISA), Université de Rennes 1 (UR1), Université de Rennes (UNIV-RENNES)-Université de Rennes (UNIV-RENNES)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), Institut National des Sciences Appliquées (INSA)-Université de Rennes (UNIV-RENNES)-Institut National des Sciences Appliquées (INSA)-Université de Bretagne Sud (UBS)-École normale supérieure - Rennes (ENS Rennes)-Institut National de Recherche en Informatique et en Automatique (Inria)-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-IMT Atlantique Bretagne-Pays de la Loire (IMT Atlantique), Institut Mines-Télécom [Paris] (IMT)-Institut Mines-Télécom [Paris] (IMT)-Université de Rennes 1 (UR1), Institut Mines-Télécom [Paris] (IMT)-Institut Mines-Télécom [Paris] (IMT)-Institut de Recherche en Informatique et Systèmes Aléatoires (IRISA), Institut National des Sciences Appliquées (INSA)-Université de Rennes (UNIV-RENNES)-Institut National des Sciences Appliquées (INSA)-Université de Bretagne Sud (UBS)-École normale supérieure - Rennes (ENS Rennes)-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-IMT Atlantique Bretagne-Pays de la Loire (IMT Atlantique), Institut Mines-Télécom [Paris] (IMT)-Institut Mines-Télécom [Paris] (IMT), Université de Rennes (UR)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Université de Bretagne Sud (UBS)-École normale supérieure - Rennes (ENS Rennes)-Institut National de Recherche en Informatique et en Automatique (Inria)-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-IMT Atlantique (IMT Atlantique), Institut Mines-Télécom [Paris] (IMT)-Institut Mines-Télécom [Paris] (IMT)-Université de Rennes (UR)-Institut National des Sciences Appliquées - Rennes (INSA Rennes), Institut National des Sciences Appliquées (INSA)-Institut National des Sciences Appliquées (INSA)-Université de Bretagne Sud (UBS)-École normale supérieure - Rennes (ENS Rennes)-CentraleSupélec-Centre National de la Recherche Scientifique (CNRS)-IMT Atlantique (IMT Atlantique)
Jazyk:	angličtina
Rok vydání:	2021
Předmět:	0301 basic medicine Computer science Contextualised Embeddings computer.software_genre Terminology [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI] 03 medical and health sciences 0302 clinical medicine Evaluation methods Segmentation [INFO]Computer Science [cs] Survey business.industry Applied Mathematics Word Embeddings Word Embedding Evaluation Computer Science Applications [INFO.INFO-TT]Computer Science [cs]/Document and Text Processing Management information systems 030104 developmental biology Computational Theory and Mathematics Non-Euclidean Embeddings 030220 oncology & carcinogenesis Modeling and Simulation Artificial intelligence State (computer science) business Focus (optics) computer Natural language processing Word (computer architecture) Strengths and weaknesses Information Systems
Zdroj:	International Journal of Data Science and Analytics International Journal of Data Science and Analytics, Springer Verlag, 2021, 11 (2), pp.85-103. ⟨10.1007/s41060-021-00242-8⟩ International Journal of Data Science and Analytics, 2021, 11 (2), pp.85-103. ⟨10.1007/s41060-021-00242-8⟩
ISSN:	2364-415X
DOI:	10.1007/s41060-021-00242-8⟩
Popis:	International audience; Word Embeddings have proven to be effective for many Natural Language Processing tasks by providing word representations integrating prior knowledge. In this article, we focus on the algorithms and models used to compute those representations and on their methods of evaluation. Many new techniques were developed in a short amount of time and there is no unified terminology to emphasise strengths and weaknesses of those methods. Based on the state of the art, we propose a thorough terminology to help with the classification of these various models and their evaluations. We also provide comparisons of those algorithms and methods, highlighting open problems and research paths, as well as a compilation of popular evaluation metrics and datasets. This survey gives: 1) an exhaustive description and terminology of currently investigated word embeddings, 2) a clear segmentation of evaluation methods and their associated datasets, and 3) high-level properties to indicate pros and cons of each solution.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::7301a886ce62d3f070714a5e1991cf1b https://hal.archives-ouvertes.fr/hal-03148517/document Zobrazit plný text záznamu Full text from SpringerLink