Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Basic GPU application, integer calculations

Tags:

c

gpgpu

gpu

Long story short, I have done several prototypes of interactive software. I use pygame now (python sdl wrapper) and everything is done on CPU. I am starting to port it to C now and at the same time search for the existing possibilities to use some GPU power to entlast the CPU from redundant operations. However I cannot find a good "guideline" what exact technology/tools should I pick in my situation. I just read plethora of docs, it drains my mental powers very fast. I am not sure if it is possible at all, so I'm puzzled.
Here I've made a very rough sketch of my typical application skeleton that I develop, but given that it uses GPU now (note, I have almost zero practical knowledge about GPU programming). Still important is that data types and functionality must be exactly preserved. Here it is:
enter image description here

So F(A,R,P) is some custom function, for example element substitution, repetition, etc. Function is presumably constant in program lifetime, rectangle's shapes generally are not equal with A shape, so it is not in-place calculation. So they are simply generated whith my functions. Examples of F: repeat rows and columns of A; substitute values with values from Substitution tables; compose some tiles into single array; any math function on A values, etc. As said all this can be easily made on CPU, but app must be really smooth. BTW in pure Python it became just unusable after adding several visual features, which are based on numpy arrays. Cython helps to make fast custom functions but then the source code is already kind of a salad.

Question:

  • Does this schema reflect some (standart) technology/dev.tools?

  • Is CUDA what I am looking for? If yes, some links/examples which coincides whith my application structure, would be great.

I realise, this a big question, so I will give more details if it helps.


Update

Here is a concrete example of two typical calculations for my prototype of bitmap editor. So the editor works with indexes and the data include layers with corresponding bit masks. I can determine the size of layers and masks are same size as layers and, say, all layers are same size (1024^2 pixels = 4 MB for 32 bit values). And my palette is say, 1024 elements (4 Kilobytes for 32 bpp format).
Consider I want to do two things now:

Step 1. I want to flatten all layers in one. Say A1 is default layer (background) and layers 'A2' and 'A3' have masks 'm2' and 'm3'. In python i'd write:

from numpy import logical_not
...
Result = (A1 * logical_not(m2) + A2 * m2) * logical_not(m3) + A3 * m3

Since the data is independent I believe it must give speedup proportionl to number of parallel blocks.

Step 2. Now I have an array and want to 'colorize' it with some palette, so it will be my lookup table. As I see now, there is a problem with simultanous read of lookup table element. enter image description here

But my idea is, probably one can just duplicate the palette for all blocks, so each block can read its own palette? Like this: enter image description here

like image 715
Mikhail V Avatar asked May 20 '15 23:05

Mikhail V


2 Answers

When your code is highly parallel (i.e. there are small or no data dependencies between stages of processing) then you can go for CUDA (more finegrained control over synching) or OpenCL (very similar AND portable OpenGL-like API to interface with the GPU for kernel processing). Most of the acceleration work we do happens in OpenCL, which has excellent interop with both OpenGL and DirectX, but we also have the same setup working with CUDA. One big difference between CUDA and OpenCL is that in CUDA you can compile kernels once and delay-load (and/or link) them in your app, whereas in OpenCL the compiler plays nice with the OpenCL driver stack to ensure the kernel is compiled when the app starts.

One alternative that is often overlooked if you're using Microsoft Visual Studio is C++AMP, a C++ syntax-friendly and intuitive api for those who do not want to dig into the logic twists and turns of OpenCL/CUDA API's. Big advantage here is that the code also works if you do not have a GPU in the system, but then you do not have as many options to tweak performance. Still, in a lot of cases, this is a fast and efficient way to write proof your concept code and re-implement bits and parts in CUDA or OpenCL later.

OpenMP and Thread Building Blocks are only good alternatives when you have synching issues and lots of data dependencies. Native threading using worker threads is also a viable solution, but only if you have a good idea on how synch-points can be set up between the different processes in such a way that threads do not starve each-other out when fighting for priority. This is a lot harder to get right, and tools such as Parallel Studio are a must. But then, so is NVida NSight if you're writing GPU code.

Appendix:

A new platform called Quasar (http://quasar.ugent.be/blog/) is being developed that enables you to write your math problems in a syntax that is very similar to Matlab, but with full support of c/c++/c# or java integration, and cross-compiles (LLVM, CLANG) your "kernel" code to any underlying hardware configuration. It generates CUDA ptx files, or runs on openCL, or even on your CPU using TBB's, or a mixture of them. Using a few monikers, you can decorate the algorithm so that the underlying compiler can infer types (you can also explicitly use strict typing), so you can leave the type-heavy stuff entirely up to the compiler. To be fair, at the time of writing, the system is still w.i.p. and the first OpenCL compiled programs are just being tested, but most important benefit is fast prototyping with almost identical performance compared to optimized cuda.

like image 194
StarShine Avatar answered Nov 09 '22 04:11

StarShine


What you want to do is send values really fast to the GPU using the high frequency dispatch and then display the result of a function which is basically texture lookups and some parameters.

I would say this problem will only be worth solving on the GPU if two conditions are met:

  1. The size of A[] is optimised to make the transfer times irrelevant (Look at, http://blog.theincredibleholk.org/blog/2012/11/29/a-look-at-gpu-memory-transfer/).

  2. The lookup table is not too big and/or the lookup values are organized in a way that the cache can be maximally utilized, in general random lookups on the GPU can be slow, ideally you can pre-load the R[] values in a shared memory buffer for each element of the A[] buffer.

If you can answer both of those questions positively then and only then consider having a go at using the GPU for your problem, else those 2 factors will overpower the computational speed-up that the GPU can provide you with.

Another thing you can have a look at is to as best as you can overlap the transfer and computing times to hide as much as possible the slow transfer rates of CPU->GPU data.

Regarding your F(A, R, P) function you need to make sure that you do not need to know the value of F(A, R, P)[0] in order to know what the value of F(A, R, P)[1] is because if you do then you need to rewrite F(A, R, P) to go around this issue, using some parallelization technique. If you have a limited number of F() functions then this can be solved by writing a parallel version of each F() function for the GPU to use, but if F() is user-defined then your problem becomes a bit trickier.

I hope this is enough information to have an informed guess towards whether you should or not use a GPU to solve your problem.

EDIT

Having read your edit, I would say yes. The palette could fit in shared memory (See GPU shared memory size is very small - what can I do about it?) which is very fast, if you have more than one palette, you could fit 16KB (size of shared mem on most cards) / 4KB per palette = 4 palettes per block of threads.

One last warning, integer operations are not the fastest on the GPU, consider using floating points if necessary after you have implemented your algorithm and it is working as a cheap optimization.

like image 1
Luis Yanes Avatar answered Nov 09 '22 05:11

Luis Yanes