Abstrakt: |
Biomedical named entity recognition is a popular research topic in the Biosciences domain as number of biomedical articles getting published are increasing rapidly. Generic models using machine learning and deep learning techniques have been proposed for extracting these entities in the past, however there is no clear verdict on which techniques are better and how these generic models perform in a domain-specific big data scenario. In this paper, we evaluate three baseline models using the most complex BioNLP 2013 cancer genetics dataset addressing the cancer domain. A classifier ensemble, bidirectional long short-term memory (Bi-LSTM) model and the bidirectional encoder representations from transformers (BERT) model are implemented. We propose NeRBERT, a domain-specific, graphical processing unit (GPU) pre-trained language model using extra biomedical corpora extending BERTBASE. Experimental results prove the efficacy of NeRBERT as it outperforms the other three models with an F1-score gain of 12.18 pp, 8.59 pp and 5.43 pp over the ensemble, Bi-LSTM and BERT models respectively. GPUs reduce the model training time to less than half. Comparing it to existing state-of-the-art models, it performs 1.57 pp higher than the next best existing model compared, emerging as a robust biomedical and cancer phenotyping NER tagger. [ABSTRACT FROM AUTHOR] |