The MARCELL Legislative Corpus

Autor: Váradi, Tamás, Koeva, Svetla, Yamalov, Martin, Tadić, Marko, Sass, Bálint, Nitoń, Bartłomiej, Ogrodniczuk, Maciej, Pęzik, Piotr, Barbu Mititelu, Verginica, Ion, Radu, Irimia, Elena, Mitrofan, Maria, Păi Textcommabelows, Vasile, Tufi Textcommabelows, Dan, Radovan Garabík, Krek, Simon, Repar, Andraz, Rihtar, Matjaž, Brank, Janez
Přispěvatelé: Calzolari, Nicoletta, Béchet, Frédéric, Blache, Philippe, Choukri, Khalid, Cieri, Christopher, Declerck, Thierry, Goggi, Sara, Isahara, Hitoshi, Maegaard, Bente, Mariani, Joseph, Mazo, Hélène, Moreno, Asuncion, Odijk, Jan, Piperidis, Stelios
Jazyk: angličtina
Rok vydání: 2020
Předmět:
Zdroj: Scopus-Elsevier
Web of Science
Radovan Garabík
Popis: This article presents the current outcomes of the MARCELL CEF Telecom project aiming to collect and deeply annotate a large comparable corpus of legal documents. The MARCELL corpus includes 7 monolingual sub-corpora (Bulgarian, Croatian, Hungarian, Polish, Romanian, Slovak and Slovenian) containing the total body of respective national legislative documents. These sub-corpora are automatically sentence split, tokenized, lemmatized and morphologically and syntactically annotated. The monolingual sub- corpora are complemented by a thematically related parallel corpus (Croatian-English). The metadata and the annotations are uniformly provided for each language specific sub-corpus. Besides the standard morphosyntactic analysis plus named entity and dependency and/or noun phrase annotation, the corpus is enriched with the IATE and EuroVoc labels. The file format is CoNLL-U Plus Format, containing the ten columns specific to the CoNLL-U format and four extra columns specific to our corpora. The MARCELL corpora represent a rich and valuable source for further studies and developments in machine learning, cross-lingual terminological data extraction and classification.
Databáze: OpenAIRE