Parallel Implementation of the Density Matrix Renormalization Group Method Achieving a Quarter petaFLOPS Performance on a Single DGX-H100 GPU Node.

Autor: Menczer A; Strongly Correlated Systems Lendület Research Group, Wigner Research Centre for Physics, H-1525 Budapest, Hungary.; Eötvös Loránd University, Pázmány Péter Sétány 1/C, 1117 Budapest, Hungary., van Damme M; SandboxAQ, 780 High Street, Palo Alto, California 94301, United States., Rask A; SandboxAQ, 780 High Street, Palo Alto, California 94301, United States., Huntington L; SandboxAQ, 780 High Street, Palo Alto, California 94301, United States., Hammond J; NVIDIA Helsinki Oy, Porkkalankatu 1, 00180 Helsinki, Finland., Xantheas SS; Advanced Computing, Mathematics, and Data Division, Pacific Northwest National Laboratory, Richland, Washington 99354, United States.; Department of Chemistry, University of Washington, Seattle, Washington 98195, United States., Ganahl M; SandboxAQ, 780 High Street, Palo Alto, California 94301, United States., Legeza Ö; Strongly Correlated Systems Lendület Research Group, Wigner Research Centre for Physics, H-1525 Budapest, Hungary.; Dynaflex Ltd., Zrínyi u 7, 1028 Budapest, Hungary.; Institute for Advanced Study,Technical University of Munich, Germany, Lichtenbergstrasse 2a, 85748 Garching, Germany.; Parmenides Stiftung, Hindenburgstr. 15, 82343 Pöcking, Germany.
Jazyk: angličtina
Zdroj: Journal of chemical theory and computation [J Chem Theory Comput] 2024 Oct 08; Vol. 20 (19), pp. 8397-8404. Date of Electronic Publication: 2024 Sep 19.
DOI: 10.1021/acs.jctc.4c00903
Abstrakt: We report cutting edge performance results on a single node hybrid CPU-multi-GPU implementation of the spin adapted ab initio Density Matrix Renormalization Group (DMRG) method on current state-of-the-art NVIDIA DGX-H100 architectures. We evaluate the performance of the DMRG electronic structure calculations for the active compounds of the FeMoco, the primary cofactor of nitrogenase, and cytochrome P450 (CYP) enzymes with complete active space (CAS) sizes of up to 113 electrons in 76 orbitals [CAS(113, 76)] and 63 electrons in 58 orbitals [CAS(63, 58)], respectively. We achieve 246 teraFLOPS of sustained performance, an improvement of more than 2.5× compared to the performance achieved on the DGX-A100 architectures and an 80× acceleration compared to an OpenMP parallelized implementation on a 128-core CPU architecture. Our work highlights the ability of tensor network algorithms to efficiently utilize high-performance multi-GPU hardware and shows that the combination of tensor networks with modern large-scale GPU accelerators can pave the way toward solving some of the most challenging problems in quantum chemistry and beyond.
Databáze: MEDLINE