Boosting Store Buffer Efficiency with Store-Prefetch Bursts

Autor:	Juan M. Cebrian, Alberto Ros, Stefanos Kaxiras
Přispěvatelé:	Facultades, Departamentos, Servicios y Escuelas::Facultades de la UMU::Facultad de Informática
Rok vydání:	2020
Předmět:	010302 applied physics Instruction prefetch Boosting (machine learning) Hardware_MEMORYSTRUCTURES Computer science 02 engineering and technology Commit computer.software_genre 01 natural sciences Buffer (optical fiber) 020202 computer hardware & architecture Datorsystem Computer Systems 0103 physical sciences 0202 electrical engineering electronic engineering information engineering Operating system Cache Latency (engineering) computer
Zdroj:	DIGITUM: Depósito Digital Institucional de la Universidad de Murcia Universidad de Murcia DIGITUM. Depósito Digital Institucional de la Universidad de Murcia instname 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) MICRO
Popis:	Virtually all processors today employ a store buffer (SB) to hide store latency. However, when the store buffer is full, store latency is exposed to the processor causing pipeline stalls. The default strategies to mitigate these stalls are to issue prefetch for ownership requests when store instructions commit and to continuously increase the store buffer size. While these strategies considerably increase memory-level parallelism for stores, there are still applications that suffer deeply from stalls caused by the store buffer. Even worse, store-buffer induced stalls increase considerably when simultaneous multi-threading is enabled, as the store buffer is statically partitioned among the threads.In this paper, we propose a highly selective and very aggressive prefetching strategy to minimize store-buffer induced stalls. Our proposal, Store-Prefetch Burst (SPB), is based on the following insights: i) the majority of store-buffer induced stalls are caused by a few stores; ii) the access pattern of such stores are easily predictable; and iii) the latency of the stores is not commonly hidden by standard cache prefetchers, as hiding their latency would require tremendous prefetch aggressiveness. SPB accurately detects contiguous store-access patterns (requiring just 67 bits of storage) and prefetches the remaining memory blocks of the accessed page in a single burst request to the L1 controller. SPB matches the performance of a 1024-entry SB implementation on a 56-entry SB (i.e., Skylake architecture). For a 14-entry SB (e.g., running four logical cores), it achieves 95.0% of that ideal performance, on average, for SPEC CPU 2017. Additionally, a 20-entry store buffer that incorporates SPB achieves the average performance of a standard 56-entry store buffer.
Databáze:	OpenAIRE
Externí odkaz:	https://explore.openaire.eu/search/publication?articleId=doi_dedup___::a90ab88d129aa220ef7986383d9ad86b http://hdl.handle.net/10201/106144 Zobrazit plný text záznamu