The MARCELL Legislative Corpus

Autor:	Váradi, Tamás, Koeva, Svetla, Yamalov, Martin, Tadić, Marko, Sass, Bálint, Nitoń, Bartłomiej, Ogrodniczuk, Maciej, Pęzik, Piotr, Barbu Mititelu, Verginica, Ion, Radu, Irimia, Elena, Mitrofan, Maria, Păi Textcommabelows, Vasile, Tufi Textcommabelows, Dan, Radovan Garabík, Krek, Simon, Repar, Andraz, Rihtar, Matjaž, Brank, Janez
Přispěvatelé:	Calzolari, Nicoletta, Béchet, Frédéric, Blache, Philippe, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Moreno, Asuncion, Odijk, Jan, Piperidis, Stelios
Jazyk:	angličtina
Rok vydání:	2020
Předmět:	law corpus comparable corpus under-resourced languages
Zdroj:	Scopus-Elsevier Web of Science Radovan Garabík
Popis:	This article presents the current outcomes of the MARCELL CEF Telecom project aiming to collect and deeply annotate a large comparable corpus of legal documents. The MARCELL corpus includes 7 monolingual sub-corpora (Bulgarian, Croatian, Hungarian, Polish, Romanian, Slovak and Slovenian) containing the total body of respective national legislative documents. These sub-corpora are automatically sentence split, tokenized, lemmatized and morphologically and syntactically annotated. The monolingual sub- corpora are complemented by a thematically related parallel corpus (Croatian-English). The metadata and the annotations are uniformly provided for each language specific sub-corpus. Besides the standard morphosyntactic analysis plus named entity and dependency and/or noun phrase annotation, the corpus is enriched with the IATE and EuroVoc labels. The file format is CoNLL-U Plus Format, containing the ten columns specific to the CoNLL-U format and four extra columns specific to our corpora. The MARCELL corpora represent a rich and valuable source for further studies and developments in machine learning, cross-lingual terminological data extraction and classification.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=dedup_wf_001::57e412f1dfc2180d9803f8790993399b https://www.bib.irb.hr/1062869 Zobrazit plný text záznamu