Leveraging Lexical-Semantic Knowledge for Text Classification Tasks

Autor: Flekova, Lucie
Jazyk: angličtina
Rok vydání: 2017
Druh dokumentu: Doctoral Thesis
Popis: This dissertation is concerned with the applicability of knowledge, contained in lexical-semantic resources, to text classification tasks. Lexical-semantic resources aim at systematically encoding various types of information about the meaning of words and their relations. Text classification is the task of sorting a set of documents into categories from a predefined set, for example, “spam” and “not spam”. With the increasing amount of digitized text, as well as the increased availability of the computing power, the techniques to automate text classification have witnessed a booming interest. The early techniques classified documents using a set of rules, manually defined by experts, e.g. computational linguists. The rise of big data led to the increased popularity of distributional hypothesis - i.e., ``a meaning of word comes from its context'' - and to the criticism of lexical-semantic resources as too academic for real-world NLP applications. For long, it was assumed that the lexical-semantic knowledge will not lead to better classification results, as the meaning of every word can be directly learned from the document itself. In this thesis, we show that this assumption is not valid as a general statement and present several approaches how lexicon-based knowledge will lead to better results. Moreover, we show why these improved results can be expected. One of the first problems in natural language processing is the lexical-semantic ambiguity. In text classification tasks, the ambiguity problem has often been neglected. For example, to classify a topic of a document containing the word 'bank', we don’t need to explicitly disambiguate it, if we find the word 'river' or 'finance'. However, such additional word may not be always present. Conveniently, lexical-semantic resources typically enumerate all senses of a word, letting us choose which word sense is the most plausible in our context. What if we use the knowledge-based sense disambiguation methods in addition to the information provided implicitly by the word context in the document? In this thesis, we evaluate the performance of selected resource-based word sense disambiguation algorithms on a range of document classification tasks (Chapter 3). We note that the lexicographic sense distinctions provided by the lexical-semantic resources are not always optimal for every text classification task, and propose an alternative technique for disambiguation of word meaning in its context for sentiment analysis applications. The second problem in text classification, and natural language processing in general, is the one with synonymy. The words used in training documents represent only a tiny fraction of the words in the total possible vocabulary. If we learn individual words, or senses, as features in the classification model, our system will not be able to interpret the paraphrases, where the synonymous meaning is conveyed using different expressions. How much would the classification performance improve if the system could determine that two very different words represent the same meaning? In this thesis, we propose to address the synonymy problem by automatically enriching the training and testing data with conceptual annotations accessible through lexical-semantic resources (Chapter 4). We show that such conceptual information (``supersenses''), in combination with the previous word sense disambiguation step, helps to build more robust classifiers and improves classification performance of multiple tasks (Chapter 5). We further circumvent the sense disambiguation step by training a supersense tagging model directly. Previous evidence suggests that the sense distinctions of expert lexical-semantic resources are far subtler than what is needed for downstream NLP applications, and by disambiguating the concepts directly on a supersense level (e.g., ``is the 'duck' an animal or a food?'' rather than choosing between its eight WordNet senses), we can reduce the number of errors. The third problem in text classification is the curse of dimensionality. We want to know not only if each single word predicts certain document class, but which combinations of words predict it and which ones do not. Our need for training data thus grows exponentially with the number of words monitored. Several techniques for dimensionality reduction were proposed, most recently the representation learning, producing continuous word representations in a dense vector space, also known as word embeddings. However, these vectors are again produced on an ambiguous word level, and the valuable piece of information about possible distinct senses of the same word is lost, in favor of the most frequent one(s). In this thesis, we explore if, or how, we can use lexical-semantic resources to regain the sense-level notion of semantic relatedness back while operating within the deep learning paradigm, therefore still being able to access the high-level conceptual information. We propose and evaluate a method to integrate word and supersense embeddings from large sense-disambiguated resources such as Wikipedia. We examine the impact of different training data for the quality of these embeddings, and demonstrate how to employ them in deep learning text classification experiments. Using convolutional and recurrent neural networks, we achieve a significant performance improvement over word embeddings in a range of downstream classification tasks. The application of methods proposed in this thesis is demonstrated on experiments estimating the demographics and personality of a text author, and labeling the text with its subjective charge and sentiment conveyed. We therefore also provide empirical insights into which types of features are informative for these document classification problems, and suggest explanations grounded in psychology and sociology. We further discuss the issues that can occur as human experts are prone to diverse biases when classifying data. To summarize, we could show that lexical-semantic knowledge can improve text classification tasks by supplying the hierarchy of abstract concepts, which enable better generalization over words, and that these methods are effective also in combination with the deep learning techniques.
Databáze: Networked Digital Library of Theses & Dissertations