MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages

Autor: Bañón, Marta, Esplà-Gomis, Miquel, Forcada, Mikel L., García-Romero, Cristian, Kuzman, Taja, Ljubešić, Nikola, van Noord, Rik, Sempere, Leopoldo Pla, Ramírez-Sánchez, Gema, Rupnik, Peter, Suchomel, Vít, Toral, Antonio, van der Werff, Tobias, Zaragoza, Jaume, Macken, Lieve, Rufener, Andrew, Van den Bogaert, Joachim, Daems, Joke, Tezcan, Arda, Vanroy, Bram, Fonteyne, Margot, Barrault, Loic, Costa-Jussa, Marta R., Kemp, Ellie, Pilos, Spyridon, Declercq, Christophe, Koponen, Maarit, Scarton, Carolina, Moniz, Helena
Přispěvatelé: Computational Linguistics (CL)
Jazyk: angličtina
Rok vydání: 2022
Zdroj: EAMT 2022-Proceedings of the 23rd Annual Conference of the European Association for Machine Translation, 303-304
STARTPAGE=303;ENDPAGE=304;TITLE=EAMT 2022-Proceedings of the 23rd Annual Conference of the European Association for Machine Translation
Popis: We introduce the project MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages, funded by the Connecting Europe Facility, which is aimed at building monolingual and parallel corpora for under-resourced European languages. The approach followed consists of crawling large amounts of textual data from selected top-level domains of the Internet, and then applying a curation and enrichment pipeline. In addition to corpora, the project will release the free/open-source web crawling and curation software used.
Databáze: OpenAIRE