Clustering Analysis of Very Large Measurement and Model Data Sets on High‐Performance Computing Platforms

Publikasjonsdetaljer

Tidsskrift: Journal of Geophysical Research (JGR): Atmospheres, vol. 131, e2025JD043822, 2026

Doi: doi.org/10.1029/2025jd043822
Arkiv: hdl.handle.net/11250/5342335

Sammendrag:
Abstract Hierarchical agglomerative clustering is a useful analysis technique which allows for a level of stability, interpretability and flexibility not available in other similar techniques such as K‐means, density‐based clustering or positive matrix factorization. Previous studies using hierarchical clustering on atmospheric model output have been limited to small domain sizes (roughly 100 × 100 grid cells) by the computational expense and memory requirements of the algorithm. Here we present a scalable hierarchical clustering implementation that we apply to two year‐long, hourly atmospheric data sets: model concentration and deposition timeseries at 290,520 locations over Alberta and Saskatchewan (538 540 grid); and 366,427 multi‐pollutant observations from 51 national air pollution surveillance stations located across Canada. When combined with other information such as emissions source locations, orography, and prevailing meteorological conditions, the method yields coherent, interpretable structures. In the case of model time series, the clustering provides regions of similar air quality (airsheds) which can be used to inform air quality monitoring network placement, or regions of similar deposition which can inform critical load assessment as well as monitoring site locations. In the case of the multi‐pollutant observations, we show that a single low‐primary pollutant cluster appears the most frequently at all but one of 51 stations across Canada, accounting for 62% of all station‐hours, while elevated SO 2 appears in factor profiles at certain monitoring locations near industrial and shipping activity. Together, these results demonstrate that hierarchical clustering can efficiently summarize patterns relevant to airshed mapping and source apportionment at previously unreachable scales.