Popis: |
Accelerators, like GPUs, have become a trend to deliver future performance desire, and sharing the same virtual memory space between CPUs and GPUs is increasingly adopted to simplify programming. However, address translation, which is the key factor of virtual memory, is becoming the bottleneck of performance for GPUs. In GPUs, a single TLB miss can stall hundreds of threads due to the SIMT execute model, degrading performance dramatically. Through real system analysis, we observe that the OS shows an advanced contiguity (e.g., hundreds of contiguous pages), and more large memory regions with advanced contiguity tend to be allocated with the increase of working sets. Leveraging the observation, we propose MESC to improve the translation efficiency for GPUs. The key idea of MESC is to divide each large page frame (2MB size) in virtual memory space into memory subregions with fixed size (i.e., 64 4KB pages), and store the contiguity information of subregions and large page frames in L2PTEs. With MESC, address translations of up to 512 pages can be coalesced into single TLB entry, without the needs of changing memory allocation policy (i.e., demand paging) and the support of large pages. In the experimental results, MESC achieves 77.2% performance improvement and 76.4% reduction in dynamic translation energy for translation-sensitive workloads. |