MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages
Autor: | Bañón, Marta, Esplà-Gomis, Miquel, Forcada, Mikel L., García-Romero, Cristian, Kuzman, Taja, Ljubešić, Nikola, van Noord, Rik, Sempere, Leopoldo Pla, Ramírez-Sánchez, Gema, Rupnik, Peter, Suchomel, Vít, Toral, Antonio, van der Werff, Tobias, Zaragoza, Jaume, Macken, Lieve, Rufener, Andrew, Van den Bogaert, Joachim, Daems, Joke, Tezcan, Arda, Vanroy, Bram, Fonteyne, Margot, Barrault, Loic, Costa-Jussa, Marta R., Kemp, Ellie, Pilos, Spyridon, Declercq, Christophe, Koponen, Maarit, Scarton, Carolina, Moniz, Helena |
---|---|
Přispěvatelé: | Computational Linguistics (CL) |
Jazyk: | angličtina |
Rok vydání: | 2022 |
Zdroj: | EAMT 2022-Proceedings of the 23rd Annual Conference of the European Association for Machine Translation, 303-304 STARTPAGE=303;ENDPAGE=304;TITLE=EAMT 2022-Proceedings of the 23rd Annual Conference of the European Association for Machine Translation |
Popis: | We introduce the project MaCoCu: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages, funded by the Connecting Europe Facility, which is aimed at building monolingual and parallel corpora for under-resourced European languages. The approach followed consists of crawling large amounts of textual data from selected top-level domains of the Internet, and then applying a curation and enrichment pipeline. In addition to corpora, the project will release the free/open-source web crawling and curation software used. |
Databáze: | OpenAIRE |
Externí odkaz: |