Abstrakt: |
More people have gotten used to online health consultations (OHC) because of the COVID-19 pandemic to reassure themselves of their health conditions or seek other treatment options. The OHC system could use named entity recognition (NER), specifically for health-related texts called biomedical NER (BioNER), to filter text entities from posting history to ease users' finding information. The terms of named entities (NEs) could be related to human anatomy that have some inconvenience or terms to find out any symptoms of the disease. However, OHC posts, especially user questions, are often non-formal sentences and even long sentences or have incorrect medical terms since the users are most likely non-trained medical professionals, which may lead to out-of-vocabulary (OOV) problems. Although long short-term memory (LSTM) architecture is known for its advantage in modeling sequential data like text, and even with the bidirectional version of BiLSTM, it has some difficulties handling those long sentences. A transformer model could overcome the problems. Another problem concerns fewer annotated data in low-resource language OHC texts despite data abundance in the corpus crawling from the OHC platform. To augment data training, our process includes a self-training approach as semi-supervised learning in data preparation to improve a BioNER model. In preparation for our BioNER model, this work observes and makes a comparison on the embedding step, whether stacked embedding of BiLSTM-based or fine-tuning of transformer-based and defines filtering pseudo-labels to reduce noise from self-training. Although the empirical experiments utilized OHC texts in Indonesian as a case of low-resource language texts because of our familiarity, the procedures in this work apply to Latin alphabet-based languages. We also observed other biomedical NER model creation and topic modelling for verifying the extracted entities from the resulted BioNER model to validate the procedures. The results indicate that our framework, which includes preparing data from raw texts into labelled data using self-training, with a confidence threshold of 0.85, to create the BioNER model, can give F1 scores of 0.732 and 0.838 for BiLSTM-based and transformer-based models. [ABSTRACT FROM AUTHOR] |