What is OpenACC?
The OpenACC Organization is dedicated to helping the research and developer community advance science by expanding their accelerated and parallel computing skills. They have 3 areas of focus: participating in computing ecosystem development, providing training and education on programming models, resources and tools, and developing the OpenACC specification.
OpenACC is a user-driven directive-based performance-portable parallel programming model. It is designed for scientists and engineers interested in porting their codes to a wide-variety of heterogeneous HPC hardware platforms and architectures with significantly less programming effort than required with a low-level model. The OpenACC specification supports C, C++, Fortran programming languages and multiple hardware architectures including X86 & POWER CPUs, NVIDIA GPUs, and Xeon KNL in the near future.
This three-step tutorial is designed to show you how to take advantage of compilers and libraries to quickly accelerate your codes with CPUs and GPUs so that you can spend more time on real breakthroughs.
This tutorial uses PGI OpenACC compiler for C, C++, Fortran, along with tools from the PGI Community Edition, but we encourage you to find compilers and tools that fit your project requirements among a variety of products offered by OpenACC members.
Analyze your code using profiling tools. Identify functions and loops that will run faster on GPUs. A generated baseline CPU profile shows where an executable is spending the most time. Check if some operations identified by the profiler have been already accelerated on GPUs through existing GPU libraries and then proceed with OpenACC directives.
Now you can begin exposing parallelism starting with the functions and loops that take the most time on a CPU. OpenACC compiler will run GPU parts of the code identified by directives or pragmas. Use #pragma acc parallel to initiate parallel execution, #pragma acc kernel and loop to execute a kernel or surrounding loops on a GPU.
Optimizing data movements can bring a significant performance increase. Use loop optimizations to achieve even faster results. Note that if you use a Pascal GPU, data movements will be performed by the GPU itself without a need to add additional directives.