PoN: Open source solution for real-time data analysis

Autor: Dong-Ryeol Shin, Nikitha Johnsirani Venkatesan, Earl Kim
Rok vydání: 2016
Předmět:
Zdroj: 2016 Third International Conference on Digital Information Processing, Data Mining, and Wireless Communications (DIPDMWC).
DOI: 10.1109/dipdmwc.2016.7529409
Popis: With rapid innovations and growing Internet population, petabytes of information are being generated every second. Processing these enormous data and analysing is a tedious process now-a-days. The amount of data in real-time is growing tremendously. Nearly 80% of the data is in unstructured format. Analysis of unstructured data in real-time is a very challenging task. Existing traditional business intelligence (BI) tools perform best only in a pre-defined schema. Most of the real-time data are logs and dont have any defined schema. Doing queries over these large datasets takes long time. During streaming of real-time data, much unwanted information is extracted from the data source causing overhead in the system. This results in an increase in the cost of construction and maintenance. Each and every second, new data streams keeps accumulating in the system consistently about whats going on in the world. Gathering these data and processing is an essential skill to know, for preparing a vital report. In this paper, we propose a Piece of News (PoN) end-to-end solution where we used the appropriate Hadoop components for real-time data analytics. Our aim is to extract the health data from the normal news data so that we can predict any real-time breakouts immediately. Rather than collecting all the news, we filtered only the important news based on certain threshold, thus reducing the cost. We compared historical data with real-time data which leads to take prompt action as we already knew the outbreaks from the previous data. One step ahead we can even detect any dangerous outbreaks before anyone else in the world. Not only we did real-time analytics using Hadoop componants but also we ran queries over the collected news dataset using Hive and Pig. Finally, we presented their performance comparison.
Databáze: OpenAIRE