GPGPU vs. Multicore?

1 Answers

Interesting question. I have researched this very problem so my answer is based on some references and personal experiences.

What types of problems are better suited to regular multicore and what types are better suited to GPGPU?

Like @Jared mentioned. GPGPU are built for very regular throughput workloads, e.g., graphics, dense matrix-matrix multiply, simple photoshop filters, etc. They are good at tolerating long latencies because they are inherently designed to tolerate Texture sampling, a 1000+ cycle operation. GPU cores have a lot of threads: when one thread fires a long latency operation (say a memory access), that thread is put to sleep (and other threads continue to work) until the long latency operation finishes. This allows GPUs to keep their execution units busy a lot more than traditional cores.

GPUs are bad at handling branches because GPUs like to batch "threads" (SIMD lanes if you are not nVidia) into warps and send them down the pipeline together to save on instruction fetch/decode power. If threads encounter a branch, they may diverge, e.g., 2 threads in a 8-thread warp may take the branch while the other 6 may not take it. Now the warp has to be split into two warps of size 2 and 6. If your core has 8 SIMD lanes (which is why original warp pakced 8 threads), now your two newly formed warps will run inefficiently. The 2-thread warp will run at 25% efficiency and the 6-thread warp will run at 75% efficiency. You can imagine that if a GPU continues to encounter nested branches, its efficiency becomes very low. Therefore, GPUs aren't good at handling branches and hence code with branches should not be run on GPUs.

GPUs are also bad a co-operative threading. If threads need to talk to each other then GPUs won't work well because synchronization is not well-supported on GPUs (but nVidia is on it).

Therefore, the worst code for GPU is code with less parallelism or code with lots of branches or synchronization.

What are the key differences in programming model?

GPUs don't support interrupts and exception. To me thats the biggest difference. Other than that CUDA is not very different from C. You can write a CUDA program where you ship code to the GPU and run it there. You access memory in CUDA a bit differently but again thats not fundamental to our discussion.

What are the key underlying hardware differences that necessitate any differences in programming model?

I mentioned them already. The biggest is the SIMD nature of GPUs which requires code to be written in a very regular fashion with no branches and inter-thread communication. This is part of why, e.g., CUDA restricts the number of nested branches in the code.

Which one is typically easier to use and by how much?

Depends on what you are coding and what is your target.

Easily vectorizable code: CPU is easier to code but low performance. GPU is slightly harder to code but provides big bang for the buck. For all others, CPU is easier and often better performance as well.

Is it practical, in the long term, to implement high level parallelism libraries for the GPU, such as Microsoft's task parallel library or D's std.parallelism?

Task-parallelism, by definition, requires thread communication and has branches as well. The idea of tasks is that different threads do different things. GPUs are designed for lots of threads that are doing identical things. I would not build task parallelism libraries for GPUs.

If GPU computing is so spectacularly efficient, why aren't CPUs designed more like GPUs?

Lots of problems in the world are branchy and irregular. 1000s of examples. Graph search algorithms, operating systems, web browsers, etc. Just to add -- even graphics is becoming more and more branchy and general-purpose like every generation so GPUs will be becoming more and more like CPUs. I am not saying they will becomes just like CPUs, but they will become more programmable. The right model is somewhere in-between the power-inefficient CPUs and the very specialized GPUs.

189

answered Sep 28 '22 19:09

Aater Suleman

Related questions
                            
                                OpenMP set_num_threads() is not working
                            
                                Calling condition.wait() inside thread causes retrieval of any future to block on main thread
                            
                                What is the difference between Child_process and Worker Threads?
                            
                                Why do I need a memory barrier?
                            
                                Existing threadpool C implementation [closed]
                            
                                How does `this` reference to an outer class escape through publishing inner class instance?
                            
                                Can I set the number of Threads/CPUs available to the Java VM?
                            
                                Making sure OnPropertyChanged() is called on UI thread in MVVM WPF app
                            
                                Python creating a shared variable between threads
                            
                                Misunderstanding the difference between single-threading and multi-threading programming
                            
                                std::map thread-safety
                            
                                In what situation do you use a semaphore over a mutex in C++?
                            
                                using ThreadStatic variables with async/await
                            
                                Thread safe lazy construction of a singleton in C++
                            
                                How many threads can I run concurrently?
                            
                                Why no AutoResetEventSlim in BCL?
                            
                                Using static mutex in a class
                            
                                Behavior of Python's time.sleep(0) under linux - Does it cause a context switch?
                            
                                How to continue one thread at a time when debugging a multithreaded program in GDB?
                            
                                Does Java have support for multicore processors/parallel processing?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

GPGPU vs. Multicore?

Tags:

performance

multithreading

parallel-processing

gpgpu

multicore

dsimcha

People also ask

1 Answers

Aater Suleman

Recent Activity

Donate For Us