Clustering analysis of very large measurement and model datasets on high performance computing platforms.

Autor:	Lee, Colin J., Makar, Paul A., Soares, Joana
Předmět:	HIGH performance computing COMPUTING platforms CLUSTER analysis (Statistics) HIERARCHICAL clustering (Cluster analysis) MATRIX decomposition CENTRAL processing units
Zdroj:	Geoscientific Model Development Discussions; 11/20/2023, p1-30, 30p
Abstrakt:	Spatiotemporal clustering of data is an important analytical technique with many applications in air quality, including source identification, monitoring network analysis and airshed partitioning. Hierarchical agglomerative clustering is one such algorithm, where sets of input data are grouped by a chosen similarity metric, without requiring any a priori information about the final arrangement of clusters. Modern implementations of the algorithm have O(n² 10 log (n)) computational complexity and O;(n²) memory usage, where n is the number of initial clusters. This dependence can strain the resources of even very large individual computers as the number of initial clusters increases into the tens or hundreds of thousands, for example, to cluster all the points in an air-quality model’s simulation grid as part of airshed analysis (~10⁵ to 106 time series to be clustered). Using two parallelization techniques – the Message Passing Interface (MPI) and Open Multi-Processing (OpenMP) – we have reduced the amount of wallclock time while increasing the memory available to a new hierarchical clustering program, by dividing up the program into blocks which are run on separate CPUs but communicate with each other to produce a single result. The new algorithm opens up new directions for large data analysis which had previously not been possible. Here we present a massively parallelized version of an agglomerative hierarchical clustering algorithm which is able to cluster an entire year of hourly regional air-quality model output (538x540 domain; 290,520 hourly concentration timeseries) in 12 hours 37 minutes of wallclock time, by spreading the computation across 8000 Intel® Xeon® Platinum 8830 CPU cores with a total of 2TB of RAM. We then show how the new algorithm allows a new form of air-quality analysis to be carried out starting from air-quality model output. We present maps of the different airsheds within the model domain, identifying equally unique regions for each chemical species. These regions can be used as an aid in determining the placement of surface air-quality monitors which gives the most representative sampling given a fixed number of monitors, or the number of monitors required for a given level of similarity between airsheds. We then demonstrate the new algorithm’s application towards source apportionment of very large observation data sets, through the analysis of a year of Canada’s hourly National Air Pollution Surveillance Program data, comprising 366,427 original observation vectors, a problem size that would be impossible with other source apportionment programs such as Positive Matrix Factorization. [ABSTRACT FROM AUTHOR]
Databáze:	Complementary Index
Externí odkaz:	Zobrazit plný text záznamu