Using Exploratory Data Analysis and Support Vector Machine to Build Media Classifiers on Sport News

Autor: Chu, Cheng-Wei, 褚承威
Rok vydání: 2018
Druh dokumentu: 學位論文 ; thesis
Popis: 106
News is a report which show a situation of a problem, event or process at that time. In the past, newspapers are the most common media for spreading news. As the Internet and social media grow rapidly, people’s habits have changed. Nowadays, a majority of people prefers to read digital news instead of news in paper. This study aims to develop a classifier of digital news to predict the newspaper publisher of the news. Over four thousands news articles of sport category published by the four major Taiwanese newspapers: United Daily News, Apple Daily, China Times, Liberty Times, in December, 2017, are collected as training data. Commonly every item of digital news is formed by a title, text content and photos. Hence, the first and the essential step of the analysis is input variable (feature) quantification from available information of news. Moreover, to explore the routine of every newspaper and to improve the computational efficiency, an initial exploratory data analysis (EDA) on the input variables is conducted and relative important variables are selected for classifier development. For the text data, the term frequency-inverse document frequency (TF-IDF) is applied for a keywords selection method. Then, we use these selected variables to build newspaper classifiers by support vector machine (SVM). In our study, we find that a simple classifier based on 19 non-text input variables can achieve a high accuracy. Among them, the image dimensions are the most critical variables. On the other hand, when only considering text information, we observe that few text variables can have excellent classification results.
Databáze: Networked Digital Library of Theses & Dissertations