Effectively Exploiting Parallel Scale for All Problem Sizes in LU Factorization

Autor:	R. Clint Whaley, Rakib Hasan
Rok vydání:	2014
Předmět:	Xeon Parallel processing (DSP implementation) Computer science law Clock rate Parallel algorithm Cache Parallel computing Operand LU decomposition law.invention Matrix decomposition
Zdroj:	IPDPS
DOI:	10.1109/ipdps.2014.109
Popis:	LU factorization is one of the most widely-used methods for solving linear equations, and thus its performance underlies a broad range of scientific computing. As architectural trends have replaced clock rate improvements with increases in parallel scale, library writers have responded by using tiled algorithms, where operand size is constrained in order to maximize parallelism, as seen in the well-known PLASMA library. This approach has two main drawbacks: (1) asymptotic performance is reduced due to limited operand size, (2) performance of small to medium sized problems is reduced due to unnecessary data motion in the parallel caches. In this paper we introduce a new approach where asymptotic performance is maximized by using special low-overhead kernel primitives that are auto-generated by the ATLAS framework, while unnecessary cache motion is minimized by using explicit cache management. We show that this technique can outperform all known libraries at all problem sizes on commodity parallel Intel and AMD platforms, with asymptotic LU performance of roughly 91% of hardware theoretical peak for a 12-core Intel Xeon, and 87% for a 32-core AMD Opteron.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_________::08ba58247f2604357a5dfdd13b0728fa https://doi.org/10.1109/ipdps.2014.109 Zobrazit plný text záznamu