I've been searching the web but I'm still very confused about this topic. Can anyone explain this more clearly? I come from an Aerospace Engineering background (not from a Computer Science one), so when I read online about OpenMP/CUDA/etc. and multithreading I don't really understand a great deal of what is being said. I'm currently trying to parallelize an in-house CFD software written in FORTRAN. These are my doubts: <ol> <li>OpenMP shares the workload using multiple threads from the CPU. Can it be used to allow the GPU to get some of the work too?</li> <li>I've read about OpenACC. Is it similar to OpenMP (easy to use)?</li> </ol> I've also read about CUDA and kernels, but I don't have any much experience in parallel programming and I don't have the faintest idea of what a kernel is. <ol start="3"> <li>Is there an easy and portable way to share my workload with the GPU, for FORTRAN (if OpenMP doesn't do that and OpenACC is not portable)?</li> </ol> Can you give me a "for dummies" type of answer?

<ol> <li>OpenMP 4.0 standard includes support of accelerators (GPU, DSP, Xeon Phi, and so on), but I don't know any existence implementation of OpenMP 4.0 standard for GPU, only early experience.</li> <li>OpenACC is indeed similar to OpenMP and easy to use. Good OpenACC tutorial: part 1 and part 2.</li> </ol> Unfortunately, I think there is no portable solution for CPU and GPU, at least for now (except for OpenCL, but it is too low level compare to OpenMP and OpenACC). If you need portable solution, you could consider using Intel Xeon Phi accelerator instead of GPU. Intel Fortran (and C/C++) compiler includes OpenMP support both for CPU and Xeon Phi. In addition, to create a really portable solution, it is not enough to use suitable parallel technology. You have to modify your program in order to provide enough level of parallelism. See "Structured Parallel Programming" or similar books for examples of possible approaches.

Can OpenMP be used for GPUs?

Tags:

multithreading

fortran

gpu

openmp

openacc

I've been searching the web but I'm still very confused about this topic. Can anyone explain this more clearly? I come from an Aerospace Engineering background (not from a Computer Science one), so when I read online about OpenMP/CUDA/etc. and multithreading I don't really understand a great deal of what is being said.

I'm currently trying to parallelize an in-house CFD software written in FORTRAN. These are my doubts:

OpenMP shares the workload using multiple threads from the CPU. Can it be used to allow the GPU to get some of the work too?
I've read about OpenACC. Is it similar to OpenMP (easy to use)?

I've also read about CUDA and kernels, but I don't have any much experience in parallel programming and I don't have the faintest idea of what a kernel is.

Is there an easy and portable way to share my workload with the GPU, for FORTRAN (if OpenMP doesn't do that and OpenACC is not portable)?

Can you give me a "for dummies" type of answer?

448

asked Mar 10 '15 11:03

André Almeida

3 Answers

The IBM-developed Clang/LLVM implementation of OpenMP 4+ for NVIDIA GPUs is available from https://github.com/clang-ykt. The build recipe is provided in "OpenMP compiler for CORAL/OpenPower Heterogeneous Systems".

The Cray compiler supports OpenMP target for NVIDIA GPUs. From Cray Fortran Reference Manual (8.5):

The OpenMP 4.5 target directives are supported for targeting NVIDIA GPUs or the current CPU target. An appropriate accelerator target module must be loaded to use target directives.

The Intel compiler supports OpenMP target for Intel Gen graphics for C/C++ but not Fortran. Furthermore, the teams and distribute clauses are not supported because they are not necessary/appropriate. Below is a simple example showing how the OpenMP target features work in different environments.

Click to copy

void vadd2(int n, float * a, float * b, float * c)
{
    #pragma omp target map(to:n,a[0:n],b[0:n]) map(from:c[0:n])
#if defined(__INTEL_COMPILER) && defined(__INTEL_OFFLOAD)
    #pragma omp parallel for simd
#else
    #pragma omp teams distribute parallel for simd
#endif
    for(int i = 0; i < n; i++)
        c[i] = a[i] + b[i];
}

The compiler options for Intel and GCC are as follows. I don't have GCC setup for NVIDIA GPUs but you can see the documentation for the appropriate -foffload options.

Click to copy

$ icc -std=c99 -qopenmp -qopenmp-offload=gfx -c vadd2.c && echo "SUCCESS" || echo "FAIL"
SUCCESS
$ gcc-7 -fopenmp -c vadd2.c && echo "SUCCESS" || echo "FAIL"
SUCCESS

159

answered Oct 19 '22 23:10

Jeff Hammond

OpenMP 4.0 standard includes support of accelerators (GPU, DSP, Xeon Phi, and so on), but I don't know any existence implementation of OpenMP 4.0 standard for GPU, only early experience.
OpenACC is indeed similar to OpenMP and easy to use. Good OpenACC tutorial: part 1 and part 2.

Unfortunately, I think there is no portable solution for CPU and GPU, at least for now (except for OpenCL, but it is too low level compare to OpenMP and OpenACC).

If you need portable solution, you could consider using Intel Xeon Phi accelerator instead of GPU. Intel Fortran (and C/C++) compiler includes OpenMP support both for CPU and Xeon Phi.

In addition, to create a really portable solution, it is not enough to use suitable parallel technology. You have to modify your program in order to provide enough level of parallelism. See "Structured Parallel Programming" or similar books for examples of possible approaches.

answered Oct 19 '22 22:10

Andrey Sozykin

To add to what was said about support on other platforms above: IBM is contributing to two OpenMP 4.5 compilers: One is the open source Clang/LLVM one. The other is IBM's XL compiler. Both compilers share the same helper OpenMP offloading library, but differ in the compiler's code generation and optimization for the GPU. For Fortran, the XL Fortran compiler supports a large subset of OpenMP 4.5 offloading to NVIDIA GPUs, starting in version 15.1.5. (And version 13.1.5 for XL C/C++). More features are being added this year and next year, with the aim of complete support in 2018. If you're on POWER, you can join the XL compiler beta program to get access to our latest OpenMP offloading features in Fortran and C/C++.

answered Oct 19 '22 21:10

Rafik Zurob

Related questions
                            
                                Difference between Mutex, Semaphore & Spin Locks
                            
                                Setting priority to Java's threads
                            
                                Multi threading C# application with SQL Server database calls
                            
                                Why does Thread.isInterrupted () always return false?
                            
                                How does a single servlet handle multiple requests from client side
                            
                                Multi-threading with .Net HttpListener
                            
                                "Step over" when debugging multithreaded programs in Visual Studio
                            
                                Why is there no overload of Interlocked.Add that accepts Doubles as parameters?
                            
                                Running NSURLSession completion handler on main thread
                            
                                How to find a Java thread running on Linux with ps -axl?
                            
                                Checking on a thread / remove from list
                            
                                Is it possible to remove ExecutionContext and Thread allocations when using SocketAsyncEventArgs?
                            
                                Should std::future::wait be using so much CPU? Is there a more performant call?
                            
                                What does 'context' exactly mean in C# async/await code?
                            
                                Why does web worker performance sharply decline after 30 seconds?
                            
                                Android: CursorLoader, LoaderManager, SQLite
                            
                                How does Intel TBB's scalable_allocator work?
                            
                                Multiple concurrent calls to SqlCommand.BeginExecuteNonQuery using same SqlConnection
                            
                                WaitForSingleObject and WaitForMultipleObjects equivalent in Linux?
                            
                                Is ConcurrentHashMap.get() guaranteed to see a previous ConcurrentHashMap.put() by different thread?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With