We are looking for exemplar problems and codes that will run on any or all of shared memory, distributed memory, and GPGPU architectures. The reference platform we are using is LittleFe (littlefe.net), an open-design, low cost educational cluster currently with six dual core CPUs, each with an nVidia chipset.
These problems and solutions will be good for teaching parallelism to any newbie by providing working examples and opportunities to roll up your sleeves and code. Stackoverflow experts have good insight and are likely to have some favorites.
Calculating area under a curve is interesting, simple and easy to understand, but there are bound to be ones that are just as easily expressed and chock full of opportunities to practice and learn.
Hybrid examples using more than one of the memory architectures are most desirable, and reflective of where parallel programming seems to be trending.
On LittleFe we have predominantly been using three applications. The first is an analysis of optimal targets on a dartboard which is highly parallel with little communication overhead. The second is Conway's game of life which is a typical of problems sharing boundary conditions. It has a moderate communication overhead. The third is an n-body model of galaxy formation which requires heavy communication overhead.
The CUDA programming guide contains a detailed analysis of the implementation of matrix multiplication on a GPU. That seems to be the staple "hello world" example for learning GPU programing.
Furthermore, the CUDA SDK contains tens of other well explained examples of GPU programming in CUDA and OpenCL. My favorite is the colliding balls example. (a demo with a few thousands of balls colliding in real time)
The CUDA samples are not packaged with the Toolkit anymore. Instead you can find them on GitHub.
Two of my favorites are numerical integeration and finding prime numbers. For the first we code the midpoint rectangle rule on the function f(x) = 4.0 / (1.0 + x*x). Integration of the function between 0 and 1 give an approximation of the constant pi, which makes checking the correctness of the answer easy. The parallelism is across the range of the integration (computing the areas of rectangles).
For the second, we input an integer range and then identify and save the prime numbers in that range. We use a brute force division of values by all possible factors; if any divisors are found that are not 1 or the number, then the value is composite. If a prime is found, count it and store in a shared array. The parallelism is dividing up the range since testing for primality of N is independent of testing M. There is some trickiness needed to share the prime store between threads or to gather distributed parital answers.
These are very basic and simple problems to solve, which allows students to focus on the parallel implementation and not so much on the computation involved.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With