Sentiment Analysis for Patient-Author Text: Using Word2Vec and Symptoms

Autor: Zong-Yao Wu, 吳宗耀
Rok vydání: 2017
Druh dokumentu: 學位論文 ; thesis
Popis: 105
Recently, Sentiment analysis (SA) is gaining popularity. Most previous work studied product reviews with machine learning techniques to predict the sentiment polarity. They focused on how to build the patterns like statistical language models or to extract semantic features from texts. In this paper, we apply SA techniques to patient-authored text on online medical communities. Our datasets are patient-authored text (PAT) from a well-known medical website, patientslikeme.com (PLM). Patients can share mood phrases, severity of symptoms, treatment, and quality of life on PLM. PAT is more like a diary or journal reflecting on the patients themselves. There is another special point unique to the PLM datasets that is discussion of symptoms and diseases. So we will discuss the relationship of sentiment polarity and symptoms. Many studies used bag-of-word to represent document features but some studies showed that bag-of-word will lose the word a part of meaning. In our study, we attempted to explore the possibility of using “word vectors” to represent documents. Word2Vec is a tool which most want to express the concept is training the vector not only finding similar words, but also having multiple levels of meaning. In the first experiment, we used Word2Vec to generate word vectors and we used five different methods to generate sentence vector including the most-commonly used average method, no normalization method, the stop word method, and the sentiment method in the SA domain. Then we used two classifiers support vector machine (SVM) and k-nearest neighbors (k-NN) with Cosine Similarity to classify the sentiment polarity of the PATs. Some previous studies claimed that the corpus for training the Word2Vec model is very important, so we also wished to discuss the effect of corpus composition on the classification results. We prepared two corpora for second experiment which will discuss whether high quality or volume is more helpful for classification. We have observed that “PATs with reference to symptoms” have a large effect on classification from past studies. Our observation shows that negative polarity and reference to symptoms are highly correlated. Therefore we are going to use build another training model and evaluate the results based on this observation. The results show that the non-normalization method is the best in identifying positive polarity, the sentiment method is the best in identifying negative polarity. We also found that the normalization method produced worse classification results than the non-normalization method. In the second experiment, we used two different types of classifiers, i.e. SVM and k-NN. All results showed that the Word2Vec model trained on medical corpora yielded better classification performance than the Wikipedia corpus. This outcome indicated that the quality in the training corpus was more important than the volume when training Word2Vec models. In the future, we wish to further explore the usage of explicit and implicit references to symptoms in the PATs.
Databáze: Networked Digital Library of Theses & Dissertations