Analysis of a Parallel/Distributed Application Using a Cycle-Accurate Parallel/Distributed Simulator
Autor: | Omid Elahi, Mohammad Zaman Ataie |
---|---|
Rok vydání: | 2018 |
Předmět: |
010302 applied physics
Multi-core processor Out-of-order execution CPU cache Computer science business.industry Distributed computing Cloud computing 02 engineering and technology 01 natural sciences Bottleneck 020202 computer hardware & architecture Microarchitecture 0103 physical sciences 0202 electrical engineering electronic engineering information engineering Single-core business SPECint |
Zdroj: | Electrical Engineering (ICEE), Iranian Conference on. |
DOI: | 10.1109/icee.2018.8472447 |
Popis: | It is important for computer architects to have a good understanding about the applications running on the designed hardware. That is to optimize their designs and run the applications more efficiently. Designing processors for accelerating big-data and cloud applications is a hot research topic. Currently, there are only a few papers on the analysis/characterizations of emerging big-data and cloud applications. Although these studies reveal the inefficiencies in a processor micro-architecture running big-data applications, they have been conducted using real-hardware, limiting the scope and flexibility of the analysis. In this paper, we aim to characterize a big-data workload using a novel method to simulate a distributed system and optimize an out of order core for running the cloud applications. dist-gem5, is a parallel and distributed version of gem5 which allows us to efficiently simulate a large scale distributed system on a cluster. Using dist-gem5, we aim to identify the bottlenecks and inefficiencies in the server processors and their overall system architecture, cutting across off-chip network stack, operating systems, Ethernet devices and core microarchitecture. Frist, we compare the results of a set of cloud application against SPECint2006 results. Our results show that BigData workloads, compared to SPECInt, has ∼3x and ∼4x more instruction cache miss rate and branch miss prediction rate, respectively. Next, we pick Memcached, as a representative BigData workload, and analyze how its performance and power scales with more cores under different request rates and core microarchitectures. Interestingly, we find out that having more cores on a chip does not bring more performance even for a parallel application like Memcached. A quad core ARM-v7 chip can have up to 6.5x longer average request latency compared to a single core ARM-v7 chip. We find that L2 cache architecture is the bottleneck in the ARM-v7 multi-core system and fixing that can make the performance of an embedded core as good as a high performance O3 core running Memcached server. |
Databáze: | OpenAIRE |
Externí odkaz: |