Enabling Real Time Analytics over Raw XML Data

Autor: Prashant Agarwal, Manoj K. Agarwal, Krithi Ramamritham
Rok vydání: 2019
Předmět:
Zdroj: Real-Time Business Intelligence and Analytics ISBN: 9783030241230
BIRTE (Revised Selected Papers)
DOI: 10.1007/978-3-030-24124-7_8
Popis: The data generated by many applications is in semi structured format, such as XML. This data can be used for analytics only after shredding and storing it in structured format. This process is known as Extract-Transform-Load or ETL. However, ETL process is often time consuming due to which crucial time-sensitive insights can be lost or they may become un-actionable. Hence, this paper poses the following question: How do we expose analytical insights in the raw XML data? We address this novel problem by discovering additional information from the raw semi-structured data repository, called complementary information (CI), for a given user query. Experiments with real as well as synthetic data show that the discovered CI is relevant in the context of the given user query, nontrivial, and has high precision. The recall is also found to be high for most queries. Crowd-sourced feedback on the discovered CI corroborates these findings, showing that our system is able to discover highly relevant and potentially useful CI in real-world XML data repositories. Concepts behind our technique are generic and can be used for other semi-structured data formats as well.
Databáze: OpenAIRE