HTwitt: A Hadoop-based platform for analysis and visualization of streaming Twitter data

Autor:	Umit Demirbaga
Přispěvatelé:	Bartın Üniversitesi, Mühendislik Mimarlık ve Tasarım Fakültesi, Bilgisayar Mühendisliği Bölümü
Jazyk:	angličtina
Rok vydání:	2021
Předmět:	Monitoring Computer science Big data Cloud computing 02 engineering and technology Overfitting Machine learning computer.software_genre Naive Bayes classifier Artificial Intelligence 020204 information systems 0202 electrical engineering electronic engineering information engineering MapReduce Visualization business.industry 020206 networking & telecommunications Virtualization Classification Domain knowledge Artificial intelligence Data pre-processing business computer Software
Zdroj:	NEURAL COMPUTING & APPLICATIONS
Popis:	Twitter produces a massive amount of data due to its popularity that is one of the reasons underlying big data problems. One of those problems is the classification of tweets due to use of sophisticated and complex language, which makes the current tools insufficient. We present our framework HTwitt, built on top of the Hadoop ecosystem, which consists of a MapReduce algorithm and a set of machine learning techniques embedded within a big data analytics platform to efficiently address the following problems: (1) traditional data processing techniques are inadequate to handle big data; (2) data preprocessing needs substantial manual effort; (3) domain knowledge is required before the classification; (4) semantic explanation is ignored. In this work, these challenges are overcome by using different algorithms combined with a Naïve Bayes classifier to ensure reliability and highly precise recommendations in virtualization and cloud environments. These features make HTwitt different from others in terms of having an effective and practical design for text classification in big data analytics. The main contribution of the paper is to propose a framework for building landslide early warning systems by pinpointing useful tweets and visualizing them along with the processed information. We demonstrate the results of the experiments which quantify the levels of overfitting in the training stage of the model using different sizes of real-world datasets in machine learning phases. Our results demonstrate that the proposed system provides high-quality results with a score of nearly 95% and meets the requirement of a Hadoop-based classification system.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::b600a8db0d9497685729a6618ef406ce http://hdl.handle.net/11772/6651 Zobrazit plný text záznamu