Optimized Batched Linear Algebra for Modern Architectures

Autor:	Jack Dongarra, Nicholas J. Higham, Samuel D. Relton, Sven Hammarling, Mawussi Zounon
Přispěvatelé:	Rivera, Francisco F., Pena, Tomas F., Cabaleiro, Jose C.
Rok vydání:	2017
Předmět:	020203 distributed computing Mathematical optimization Computer science 010103 numerical & computational mathematics 02 engineering and technology Parallel computing 01 natural sciences Matrix (mathematics) Vectorization (mathematics) Linear algebra 0202 electrical engineering electronic engineering information engineering Interleaved memory Multiplication Cache 0101 mathematics Xeon Phi Block (data storage) Cholesky decomposition
Zdroj:	Euro-Par 2017: Parallel Processing Lecture Notes in Computer Science Lecture Notes in Computer Science-Euro-Par 2017: Parallel Processing Dongarra, J, Hammarling, S, Higham, N, Relton, S & Zounon, M 2017, Optimized Batched Linear Algebra for Modern Architectures . in F F Rivera, T F Pena & J C Cabaleiro (eds), Euro-Par 2017: Parallel Processing : 23rd International Conference on Parallel and Distributed Computing, Santiago de Compostela, Spain, August 28-September 1, 2017, Proceedings . Lecture notes in computer science, vol. 10417, Springer Nature, pp. 511-522 . https://doi.org/10.1007/978-3-319-64203-1_37 Lecture Notes in Computer Science ISBN: 9783319642024 Euro-Par
ISSN:	0302-9743 1611-3349
DOI:	10.1007/978-3-319-64203-1_37
Popis:	Solving large numbers of small linear algebra problems simultaneously is becoming increasingly important in many application areas. Whilst many researchers have investigated the design of efficient batch linear algebra kernels for GPU architectures, the common approach for many/multi-core CPUs is to use one core per subproblem in the batch. When solving batches of very small matrices, \(2\times 2\) for example, this design exhibits two main issues: it fails to fully utilize the vector units and the cache of modern architectures, since the matrices are too small. Our approach to resolve this is as follows: given a batch of small matrices spread throughout the primary memory, we first reorganize the elements of the matrices into a contiguous array, using a block interleaved memory format, which allows us to process the small independent problems as a single large matrix problem and enables cross-matrix vectorization. The large problem is solved using blocking strategies that attempt to optimize the use of the cache. The solution is then converted back to the original storage format. To explain our approach we focus on two BLAS routines: general matrix-matrix multiplication (GEMM) and the triangular solve (TRSM). We extend this idea to LAPACK routines using the Cholesky factorization and solve (POSV). Our focus is primarily on very small matrices ranging in size from \(2 \times 2\) to \(32 \times 32\). Compared to both MKL and OpenMP implementations, our approach can be up to 4 times faster for GEMM, up to 14 times faster for TRSM, and up to 40 times faster for POSV on the new Intel Xeon Phi processor, code-named Knights Landing (KNL). Furthermore, we discuss strategies to avoid data movement between sockets when using our interleaved approach on a NUMA node.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::43c8856fa20c2d3288cd99aed495b3a3 Zobrazit plný text záznamu