運用財經文本情感分析於台灣電子類股價指數趨勢預測之研究

Autor: 劉羿廷
Předmět:
Druh dokumentu: Text
Popis: 電子工業為台灣最具競爭力之產業,使得電子類股在集中市場成交比重高達 69.49%,可見電子類股的波動足以對整個台股市場造成相當大的影響。而許多研究指出,網路上的文本訊息藉由社會網路的催化而快速傳遞,會對群眾情緒造成影響,進而影響股價波動,故對於投資者而言,如果能快速分析大量網路財經文本來推測投資大眾情緒進而預測股價走勢,即可提升獲利。然而,每天有近百篇的財經文本產生,傳統的人工抽樣分析方式效率不彰且過於耗力, 已不足以負荷此巨量資料。 過去文本情感分析的研究中已證實監督式學習方法可以透過簡單量化的方式達到良好的分類效果,但監督式學習方法所使用的訓練資料集須有事先定義好的已知類別,故其有無法預期未知類別的限制,造成無法判斷文本中可能存在的未知主題,所以本研究提出一套針對財經文本的混合監督式學習與非監督式學習之情感分析方法,透過非監督式學習將 2014 整年度的電子工業財經文本進行文本主題判別、情緒指數計算與情緒傾向標注。之後配合視覺化工具作趨勢線圖分析,找出具有領先指標特性之主題,接著再用監督式學習將其結合國際指標、總體經濟指標、台股指標、技術指標等,建立分類模型以預測台灣電子類股價指數走勢。 在實驗結果中,主題標注方面,本研究發現因文本數量遠大於議題詞數量造成 TFIDF 矩陣過於稀疏,使得 TFIDF-Kmeans 主題模型分類效果不佳;而文本具有多主題之特性造成 NPMI-Concor 分群之議題詞過於複雜不易歸納,然而LDA 主題模型基於所有主題被所有文章共享的特性,使得在字詞分群與主題分類準確度都優於 TFIDF-Kmeans 和 NPMI-Concor 主題模型,分類準確度高達 98%,故後續採用 LDA 主題模型進行主題標注。情緒傾向標注方面,證實本研 究擴充後的情感詞集比起 NTUSD 有更好的字詞極性判斷效果,計算出的情緒 指數之趨勢線也較投資人常用的 MACD 之趨勢線更符合電子類股價指數之趨 勢。此外,亦發現並非所有文本的情緒指數皆具有領先特性,僅企業營運主題與總體經濟主題之文本的情緒指數能提前反應電子類股價指數趨勢,故本研究用此二主題之文本的情緒指數來建立分類模型。 接著,本研究透過比較情緒指數結合技術指標之分類模型與單純技術指標分類模型的準確率發現,前者較後者高出 7%的準確率。進一步結合間接情緒指標的分類模型更有高達 71%準確率,故證實了情感分析確實能有效提升電子股價類股指數趨勢預測準確度,以提升投資人之投資報酬率。
The electronic industry is the most competitive industry in Taiwan, and its large volume could have strong influence on the whole stock market. Many research show that text documents on the Internet have great effect on public emotion, and the public emotion could also affect the stock price. For investors, it is important to know how to analyze the potential emotion in text documents then use this information to predict the stock trend. However, the traditional way to analyze text documents by human resource cannot afford the large volume of financial text documents on the Internet. In past Sentimental Analysis research, supervised method is proven as a method could reach high accuracy, but there are limits about predicting the future trend. This research found a solution which mixed supervised and unsupervised methods to deal with these large financial text documents. First, we use unsupervised method to find out the topic of documents, and then calculate the sentimental index to judge the document’s emotional direction. After that we will produce trend line charts by visualization tools to find out which theme documents’ sentiment index are leading indicators. Furthermore, we use supervised method to integrate the sentimental index with other 24 indirect sentimental index to build the prediction model. According to the result, we found that LDA model’s performance is better than TFIDF-Kmeans model and NPMI-Concor mode because of document characteristic. Besides, sentimental dictionary I build has higher accuracy than NTUSD on judging word polarity. The trend of sentimental index and Taiwan electronic sub-index(TE) to each other is more similar than MACD line and TE to each other. We also discover that the sentiment index produced from documents about enterprise operation and macroeconomics are leading indicators, so we use these to build prediction model. Moreover, we found that the prediction model which include the sentiment index better than which only include the technical indicators. As mentioned above, the sentimental index could make the prediction of Taiwan electronic sub-index trend be more accurate and promote the return of investment.
Databáze: Networked Digital Library of Theses & Dissertations