ClimateSpark: An in-memory distributed computing framework for big climate data analytics

Autor:	Fei Hu, M. K. Bowen, Chaowei Yang, Weiwei Song, John L. Schnase, Daniel Duffy, Tsengdar Lee, Mengchao Xu
Rok vydání:	2018
Předmět:	SQL 010504 meteorology & atmospheric sciences business.industry Computer science Data management Distributed computing Big data 0211 other engineering and technologies Cloud computing 02 engineering and technology Data structure 01 natural sciences Analytics Data analysis Climate model Computers in Earth Sciences business computer 021101 geological & geomatics engineering 0105 earth and related environmental sciences Information Systems computer.programming_language
Zdroj:	Computers & Geosciences. 115:154-166
ISSN:	0098-3004
DOI:	10.1016/j.cageo.2018.03.011
Popis:	The unprecedented growth of climate data creates new opportunities for climate studies, and yet big climate data pose a grand challenge to climatologists to efficiently manage and analyze big data. The complexity of climate data content and analytical algorithms increases the difficulty of implementing algorithms on high performance computing systems. This paper proposes an in-memory, distributed computing framework, ClimateSpark , to facilitate complex big data analytics and time-consuming computational tasks. Chunking data structure improves parallel I/O efficiency, while a spatiotemporal index is built for the chunks to avoid unnecessary data reading and preprocessing. An integrated, multi-dimensional, array-based data model (ClimateRDD) and ETL operations are developed to address big climate data variety by integrating the processing components of the climate data lifecycle. ClimateSpark utilizes Spark SQL and Apache Zeppelin to develop a web portal to facilitate the interaction among climatologists, climate data, analytic operations and computing resources (e.g., using SQL query and Scala/Python notebook). Experimental results show that ClimateSpark conducts different spatiotemporal data queries/analytics with high efficiency and data locality. ClimateSpark is easily adaptable to other big multiple-dimensional, array-based datasets in various geoscience domains.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::8ba0a923a4b2105bd08fa9a9a25e4de6 https://doi.org/10.1016/j.cageo.2018.03.011 Zobrazit plný text záznamu Full Text from ScienceDirect