Building Multilingual Corpora for a Complex Named Entity Recognition and Classification Hierarchy using Wikipedia and DBpedia

Autor:	Alves, D., Thakkar, G., Amaral, G., Kuculo, T., Marko Tadić
Přispěvatelé:	Paschke, Adrian et al.
Jazyk:	angličtina
Rok vydání:	2022
Předmět:	FOS: Computer and information sciences Computer Science - Computation and Language InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL named-entity multilingualism data extraction Computation and Language (cs.CL)
Zdroj:	Scopus-Elsevier
Popis:	With the ever-growing popularity of the field of NLP, the demand for datasets in low resourced-languages follows suit. Following a previously established framework, in this paper, we present the UNER dataset, a multilingual and hierarchical parallel corpus annotated for named-entities. We describe in detail the developed procedure necessary to create this type of dataset in any language available on Wikipedia with DBpedia information. The three-step procedure extracts entities from Wikipedia articles, links them to DBpedia, and maps the DBpedia sets of classes to the UNER labels. This is followed by a post-processing procedure that significantly increases the number of identified entities in the final results. The paper concludes with a statistical and qualitative analysis of the resulting dataset. arXiv admin note: substantial text overlap with arXiv:2212.07162
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::a6b6777d0bfb1de5304a74d82744a243 http://arxiv.org/abs/2212.07429 Zobrazit plný text záznamu