B-NER: A Novel Bangla Named Entity Recognition Dataset With Largest Entities and Its Baseline Evaluation

Autor: Md. Zahidul Haque, Sakib Zaman, Jillur Rahman Saurav, Summit Haque, Md. Saiful Islam, Mohammad Ruhul Amin
Jazyk: angličtina
Rok vydání: 2023
Předmět:
Zdroj: IEEE Access, Vol 11, Pp 45194-45205 (2023)
Druh dokumentu: article
ISSN: 2169-3536
DOI: 10.1109/ACCESS.2023.3267746
Popis: Within the Natural Language Processing (NLP) framework, Named Entity Recognition (NER) is regarded as the basis for extracting key information to understand texts in any language. As Bangla is a highly inflectional, morphologically rich, and resource-scarce language, building a balanced NER corpus with large and diverse entities is a demanding task. However, previously developed Bangla NER systems are limited to recognizing only three familiar entities: person, location, and organization. To address this significant limitation, we introduce a novel Bangla NER dataset B-NER, which was created using 22,144 manually annotated Bangla sentences collected from Bangla newspapers and Bangla Wikipedia. This dataset includes a total of 9,895 unique words which were manually categorized into eight different entity types, such as a person, organization, event, artifact, time indicator, natural phenomenon, geopolitical entity, and geographical location. Inter-annotator agreement experiments were conducted to validate the quality of annotations performed by three annotators, resulting in a Kappa score of 0.82. In this paper, we provide an outline of the annotation guideline illustrated with examples, discuss the B-NER dataset properties, and present benchmark evaluations of the dataset. To establish that B-NER is more comprehensive and balanced in comparison to other publicly accessible datasets, we conducted cross-dataset modeling and validation, i.e. trained NER model on one dataset while tested on another, and found that the model trained on B-NER performed the best in that settings. Furthermore, we performed exhaustive benchmark evaluations based on Bidirectional LSTM with fastText embeddings and sentence transformer models. Among these models, fine-tuned NR/IndicbnBERT achieved noticeable results with a Macro-F1 of 86%. This dataset and baseline results will be publicly available under a CC-BY 4.0 license in the CoNLL-2002 format to facilitate further research on Bangla NER.
Databáze: Directory of Open Access Journals