Popis: |
In this thesis, the autonomization of extracting information from PDFs of Swedish film scriptsthrough various machine learning techniques and named entity recognition (NER) is explored.Furthermore, it is explored if labeled data needed for the NER tasks can be reduced to some degreewith the goal of saving time. The autonomization process is split into two subsystems, one forextracting larger chunks of text and one for extracting relevant information through named entitiesfrom some of the larger text-chunks using NER. The methods explored for accelerating the labelingtime for NER are active learning and self learning. For active learning, three methods are explored:Logprob and Word Entropy as uncertainty based active learning methods, and active learning byprocessing surprisal (ALPS) as a diversity based method. For self learning, Logprob and WordEntropy are used as they are uncertainty based sampling methods. The results find that ALPS isthe highest performing active learning method when it comes to saving time on labeling data forNER. For Self learning Word Entropy proved a successful method, whereas Logprob could notsufficiently be used for self learning. The entire script reading system is evaluated by competingagainst a human extracting information from a film script, where the human and system competeson time and accuracy. Accuracy is defined a custom F1-score based on the F1-score for NER.Overall the system performs magnitudes faster than human level, while still retaining fairly highaccuracy. The system for extracting named entities had quite low accuracy, which is hypothesisedto mainly be due to high data imbalance and too little diversity in the training data.Teknisk-naturvetenskapliga fakultetenUppsala universitet, Utgivningsort UppsalaHandledare: Björn Mosten Ämnesgranskare: Maria Andrína Fransisco Rodriguez |