Priprava programov OpenCL za učinkovito izvajanje na različnih arhitekturah

Autor: ŠEMROV, JURE
Přispěvatelé: Lotrič, Uroš
Jazyk: slovinština
Rok vydání: 2017
Předmět:
Popis: V diplomski nalogi se posvečamo predvsem vprašanju, kako programe OpenCL napisati, da se bodo učinkovito izvajali na različnih arhitekturah. Težava, s katero se soočamo, so arhitekturne razlike med sistemi. Če torej želimo doseči maksimalno učinkovitost, moramo program ustrezno prilagoditi. Prilagoditve obsegajo število računskih enot, število niti v skupini, uporabo vektorske enote, lokalnega pomnilnika in predpomnilnikov ter še druge načine za prikrivanje latence. Na kratko, izkoristiti moramo morebitne arhitekturne prednosti naprave in paralelizem tako na nivoju ukazov, kot tudi na nivoju niti. V nalogi obravnavamo pet programov, to so histogram, množenje matrik, predponska vsota, problem n teles in bitonično urejanje. Te programe prilagodimo trem različnim sistemom, in sicer CPE Intel Core i5-2450M, mnogojedrnik Xeon Phi 5110P, GPE Nvidia Tesla K20. Da bi te prilagoditve izkusili tudi v praksi, smo izmerili čas izvajanja programov za različno velike skupine in skušali razbrati kaj se dogaja. Če naše ugotovitve posplošimo, lahko privzamemo, da naj bo število skupin vsaj toliko, kot je računskih enot, skupine pa naj bodo ravno prav velike, da zmanjšamo režijo preklopa skupin in pomnilniško latenco ter obenem ne povečamo režije zaradi komunikacije ali zmanjšamo števila skupin, ki se sočasno izvajajo na računski enoti. Za učinkovito izvajanje moramo na CPE in mnogojedrniku upoštevati predpomnilnike in širino vektorske enote, medtem ko moramo na GPE čim bolje izkoristiti visoko prepustnost ter prikriti latenco z velikim številom niti in lokalnim pomnilnikom. The main question in this thesis we will be trying to solve, is how to write a proper OpenCL program to effectively run on different architectures. A problem to overcome are the architectural differences between systems. To maximize the efficiency, we need to adapt the program. This defers by the number of compute units, number of threads in a work-group, use of a vector unit, local memory and cache to minimize latency. To summarize, we need to exploit both instruction and thread level parallelism as well as other architectural advantages. We used five programs, histogram, matrix multiply, prefix sum, n body problem and bitonic sort. Then we adapted them to three different systems, Intel Core i5-2450M CPU, Xeon Phi 5110P manycore processor and Tesla K20 GPU. To test these adaptations in practice, we measured program runtime for different work-group sizes and tried to explain what is going on. Our conclusions show, that we need at least as many work-groups as there are compute units. The work-group size have to be large enough to reduce the overhead of maintaining a work-group and hide memory latency. At the same time they should be small enough to reduce overhead of communication and to keep executing more work-groups simultaneously on each compute unit. To execute programs efficiently on a CPU and manycore processors, we need to take into account caches and wideness of a vector unit, while on a GPU we need to exploit high memory throughput and hide latency with large work-groups and local memory.
Databáze: OpenAIRE