Machine Learning-based Sequence Labeling for Text Data

Autor: Liu, Shifeng
Rok vydání: 2020
Předmět:
DOI: 10.26190/unsworks/3978
Popis: In this thesis, we study the sequence labeling task. Sequence labeling task is to find the best pre-defined label assignment to each token given a token sequence. For example, in named entity recognition (NER), it is to identify entity mentions from text and classify them into pre-defined types. It is a prevalent and fundamental task for many applications such as information retrieval, knowledge base construction. Though various methods have proposed, there are still urgent challenges. As most methods apply machine learning techniques requiring high-quality annotated data for training, how to obtain sufficient annotation data becomes a crucial challenge. Besides, there are other challenges such as isolation of existing methods. We firstly solve the annotation data generation problem for NER in specific domains. Currently, most NER methods are supervised or semi-supervised methods, which require human annotated data. However, human annotated data is not sufficient due to the labor and time consuming. Tackling this challenge, we propose a dictionary extension method with headword based non-exact matching to generalize distant supervision. To reduce the impact of incorrectly annotation data, we apply a weighted function. We also propose a span-level model with the corresponding dynamic programming based inference algorithm. Experiments on all three benchmark datasets in different domains demonstrate that our method outperforms previous state-of-the-art distantly supervised methods. Observing the prediction results of existing methods, we then try to re-construct connections among existing methods. For NER, there are existing methods achieving decent results. However, these methods are constructed independently and none of them utilize the strength of existing methods. We propose a stacking model to utilize the strength and avoid the weakness of the existing methods. We design meta-features based on the prediction results of existing methods to capture their properties. We also introduce external knowledge for each token. Finally, we represent the superiority of our proposed method compared with existing methods with extensive experiments. In this thesis, we also extend our understanding in NER to another sequence labeling task, hypernym detection. Hypernym detection is a key step to construct ontology in knowledge base construction. Traditional methods detect hypernyms from the predicted definition sentences, which leads to error propagation. We handle hypernym detection and definition extraction simultaneously. We propose a two-phase method with a jointly neural network for both problem in phase I and a refinement model for hypernym extraction in phase II. We carefully design features for the model in phase II utilizing the results from phase I. We conduct experiments and show the effectiveness of our method on a well-known dataset.
Databáze: OpenAIRE