MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages

Autor:	Bañón, Marta, Esplà-Gomis, Miquel, Forcada, Mikel L., García-Romero, Cristian, Kuzman, Taja, Ljubešić, Nikola, van Noord, Rik, Sempere, Leopoldo Pla, Ramírez-Sánchez, Gema, Rupnik, Peter, Suchomel, Vít, Toral, Antonio, van der Werff, Tobias, Zaragoza, Jaume, Macken, Lieve, Rufener, Andrew, Van den Bogaert, Joachim, Daems, Joke, Tezcan, Arda, Vanroy, Bram, Fonteyne, Margot, Barrault, Loic, Costa-Jussa, Marta R., Kemp, Ellie, Pilos, Spyridon, Declercq, Christophe, Koponen, Maarit, Scarton, Carolina, Moniz, Helena
Přispěvatelé:	Computational Linguistics (CL)
Jazyk:	angličtina
Rok vydání:	2022
Zdroj:	EAMT 2022-Proceedings of the 23rd Annual Conference of the European Association for Machine Translation, 303-304 STARTPAGE=303;ENDPAGE=304;TITLE=EAMT 2022-Proceedings of the 23rd Annual Conference of the European Association for Machine Translation
Popis:	We introduce the project MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages, funded by the Connecting Europe Facility, which is aimed at building monolingual and parallel corpora for under-resourced European languages. The approach followed consists of crawling large amounts of textual data from selected top-level domains of the Internet, and then applying a curation and enrichment pipeline. In addition to corpora, the project will release the free/open-source web crawling and curation software used.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=narcis______::5a0a68e6c4f890c520eb13d3be112a97 https://research.rug.nl/en/publications/685514a8-947e-44f9-83cf-90356c5f1684 Zobrazit plný text záznamu