I've written this CUDA kernel for Conway's game of life:
__global__ void gameOfLife(float* returnBuffer, int width, int height) {
unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;
unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;
float p = tex2D(inputTex, x, y);
float neighbors = 0;
neighbors += tex2D(inputTex, x+1, y);
neighbors += tex2D(inputTex, x-1, y);
neighbors += tex2D(inputTex, x, y+1);
neighbors += tex2D(inputTex, x, y-1);
neighbors += tex2D(inputTex, x+1, y+1);
neighbors += tex2D(inputTex, x-1, y-1);
neighbors += tex2D(inputTex, x-1, y+1);
neighbors += tex2D(inputTex, x+1, y-1);
__syncthreads();
float final = 0;
if(neighbors < 2) final = 0;
else if(neighbors > 3) final = 0;
else if(p != 0) final = 1;
else if(neighbors == 3) final = 1;
__syncthreads();
returnBuffer[x + y*width] = final;
}
I am looking for errors/optimizations. Parallel programming is quite new to me and I am not sure if I get how to do it right.
The rest is a memcpy from an input array to the 2D texture inputTex bound to a CUDA array. Output is memcpy-ed from global memory to host and then dealt with.
As you can see a thread deals with a single pixel. I am unsure if that is the fastest way as some sources suggest doing a row or more per thread. If I understand correctly NVidia themselves say that the more threads, the better. I would love advice on this from someone with practical experience.
Conway's Game of Life, one of the most famous cellular automaton rules, is not reversible: for instance, it has many patterns that die out completely, so the configuration in which all cells are dead has many predecessors, and it also has Garden of Eden patterns with no predecessors.
Conway's Game of Life is a “sandbox” game. A sandbox games emphasize play that centers on exploratory changes to the game state rather than a dogged pursuit of a victory condition. Conway's Game of Life does not present the player any set goals, but that does not necessarily mean that it is an aimless pursuit.
Conway's Game of Life | Academo.org - Free, interactive, education.
My two cents.
The whole thing looks likely to be bounded by the latency of communication between multiprocessors and the GPU memory. You have code that should take something like 30-50 clock ticks to execute on its own, and it generates at least 3 memory accesses which take 200+ clock ticks each if the requisite data is not in the cache.
Using texture memory is a good way to address that, but it is not necessarily the optimal way.
At the very least, try to do 4 pixels at a time (horizontally) per thread. Global memory can be accessed 128 bytes at a time (as long as you have a warp trying to access any byte in a 128-byte interval, you might as well pull in the whole cache line at almost no additional cost). Since a warp is 32 threads, having each thread work on 4 pixels should be efficient.
Furthermore, you want to have vertically-adjacent pixels worked on by the same multiprocessor. The reason is that adjacent rows share the same input data. If you have the pixel (x=0,y=0) worked on by one MP and the pixel (x=0,y=1) is worked on by a different MP, both MPs have to issue three global memory requests each. If they are both worked on by the same MP and the results are properly cached (implicitly or explicitly), you only need a total of four. This can be done by having each thread work on several vertical pixels, or by having blockDim.y>1.
More generally, you'd probably want to have each 32-thread warp load as much memory as you have available on the MP (16-48 kb, or at least a 128x128 block), and then process all pixels within that window.
On devices of compute compatibility before 2.0, you'll want to use shared memory. On devices of compute compatibility 2.0 and 2.1, caching capabilities are much improved, so global memory may be fine.
Some nontrivial savings could be had by making sure that each warp only accesses two cache lines in each horizontal row of input pixels, instead of three, as would happen in a naive implementation that works on 4 pixels per thread, 32 threads per warp.
There's no good reason to use float as the buffer type. Not only do you end up with four times the memory bandwidth, but the code becomes unreliable and bug-prone. (For example, are you sure that if(neighbors == 3)
works correctly, since you're comparing a float and an integer?) Use unsigned char. Better yet, use uint8_t and typedef it to mean unsigned char if it's not defined.
Finally, don't underestimate the value of experimenting. Quite often performance of CUDA code can't be easily explained by logic and you have to resort to tweaking parameters and seeing what happens.
TL;DR: see: http://golly.sourceforge.net
The problem is that most CUDA implementations follow the brain dead idea of manual counting of the neighbors. This is so dead slow that any smart serial CPU implementation will outperform it.
The only sensible way to do GoL calculations is using a lookup table.
The currently fastest implementations on a CPU use lookup a square 4x4 = 16 bit block to see get the future 2x2 cells inside.
in this setup the cells are laid out like so:
01234567
0xxxxxxxx //byte0
1xxxxxxxx //byte1
2 etc
3
4
5
6
7
Some bit-shifting is employed to get a 4x4 block to fit into a word and that word is looked up using a lookup table. The lookup tables holds words as well, this way 4 different versions of the outcome can be stored in the lookup table, so you can minimize the amount of bitshifting needed to be done on the input and/or the output.
In addition the different generations are staggered, so that you only have to look at 4 neighboring slabs, instead of 9. Like so:
AAAAAAAA
AAAAAAAA BBBBBBBB
AAAAAAAA BBBBBBBB
AAAAAAAA BBBBBBBB
AAAAAAAA BBBBBBBB
AAAAAAAA BBBBBBBB
AAAAAAAA BBBBBBBB
AAAAAAAA BBBBBBBB
BBBBBBBB
//odd generations (A) are 1 pixel above and to the right of B,
//even generations (B) are 1 pixels below and to the left of A.
This alone results in a 1000x+ speed-up compared to silly counting implementations.
Then there is the optimization of not calculating slabs that are static or have a periodicity of 2.
And then there is HashLife, but that's a different beast altogether.
HashLife can generate Life patterns in O(log n) time, instead of the O(n) time normal implementations can.
This allows you to calculate generation: 6,366,548,773,467,669,985,195,496,000 (6 octillion) in mere seconds.
Unfortunately Hashlife requires recursion, and thus is difficult on CUDA.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With