Assessment of Pre-Trained Models Across Languages and Grammars

Autor:	Muñoz-Ortiz, Alberto, Vilares, David, Gómez-Rodríguez, Carlos
Rok vydání:	2023
Předmět:	Computer Science - Computation and Language
Druh dokumentu:	Working Paper
Popis:	We present an approach for assessing how multilingual large language models (LLMs) learn syntax in terms of multi-formalism syntactic structures. We aim to recover constituent and dependency structures by casting parsing as sequence labeling. To do so, we select a few LLMs and study them on 13 diverse UD treebanks for dependency parsing and 10 treebanks for constituent parsing. Our results show that: (i) the framework is consistent across encodings, (ii) pre-trained word vectors do not favor constituency representations of syntax over dependencies, (iii) sub-word tokenization is needed to represent syntax, in contrast to character-based models, and (iv) occurrence of a language in the pretraining data is more important than the amount of task data when recovering syntax from the word vectors. Comment: Accepted at IJCNLP-AACL 2023
Databáze:	arXiv
Externí odkaz:	http://arxiv.org/abs/2309.11165 Zobrazit plný text záznamu View this record from Arxiv