Advanced Techniques for High-Performance Fock Matrix Construction on GPU Clusters.

Autor: Palethorpe E; School of Computing, Australian National University, Canberra, ACT 2601, Australia., Stocks R; School of Computing, Australian National University, Canberra, ACT 2601, Australia., Barca GMJ; School of Computing and Information Systems, Melbourne University, Melbourne, VIC 3052, Australia.
Jazyk: angličtina
Zdroj: Journal of chemical theory and computation [J Chem Theory Comput] 2024 Dec 10; Vol. 20 (23), pp. 10424-10442. Date of Electronic Publication: 2024 Nov 25.
DOI: 10.1021/acs.jctc.4c00994
Abstrakt: This Article presents two optimized multi-GPU algorithms for Fock matrix construction, building on the work of Ufimtsev and Martinez [ J. Chem. Theory Comput. 2009, 5, 1004-1015] and Barca et al. [ J. Chem. Theory Comput. 2021, 17, 7486-7503]. The novel algorithms, opt-UM and opt-Brc, introduce significant enhancements, including improved integral screening, exploitation of sparsity and symmetry, a linear scaling exchange matrix assembly algorithm, and extended capabilities for Hartree-Fock caculations up to f -type angular momentum functions. Opt-Brc excels for smaller systems and for highly contracted triple-ζ basis sets, while opt-UM is advantageous for large molecular systems. Performance benchmarks on NVIDIA A100 GPUs show that our algorithms in the EXtreme-scale Electronic Structure System (EXESS), when combined, outperform all current GPU and CPU Fock build implementations in TeraChem, QUICK, GPU4PySCF, LibIntX, ORCA, and Q-Chem. The implementations were benchmarked on linear and globular systems and average speed ups across three double-ζ basis sets of 1.4×, 8.4×, and 9.4× were observed compared to TeraChem, QUICK, and GPU4PySCF respectively. An increased average speedup of 2.1× over TeraChem is observed when using four A100 GPUs. Strong scaling analysis reveals over 91% parallel efficiency on four GPUs for opt-Brc, making it typically faster for multi-GPU execution. Single-compute-node comparisons with CPU-based software like ORCA and Q-Chem show speedups of up to 42× and 31×, respectively, enhancing power efficiency by up to 18×.
Databáze: MEDLINE