Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

ArrayFire versus raw CUDA programming?

I am quite new to GPU programming, but since I have a computationally intensive task I have turned to the GPU for possible performance gains.

I tried rewriting my program with ArrayFire Free version. It is indeed faster than my CPU routine with multi-threading enabled, but not to the degree I expected (that is, < 100% speedup), and the returned results are not quite right (< 1% error compared to CPU routine, assuming the CPU routine's results are correct).

My task is mainly element-wise float-32 maths operations on large matrices (300MB-500MB size), with little if-thens/switch-cases etc. I guess the performance bottleneck is likely the bandwidth between CPU and GPU memory since there is a lot of data-reading, etc. The GPU I tested is a GeForce 580GTX with 3GB of video memory.

Is there still some significant room for optimization if I write raw CUDA code (with CUBLAS etc. and average optimization) instead of using ArrayFire for my task? I read some NVIDIA optimization guides; it seems that there is some memory-access tricks there for faster data-access and reducing bank-conflicts. Does ArrayFire use these general tricks automatically or not?

like image 846
brbs2002 Avatar asked Sep 29 '12 21:09

brbs2002


People also ask

What does ArrayFire do?

ArrayFire is a comprehensive, open source function library with interfaces for C, C++, Java, R and Fortran. It integrates with any CUDA application, and contains an array-based API for easy programmability.

Is CUDA a programming model?

This post outlines the main concepts of the CUDA programming model by outlining how they are exposed in general-purpose programming languages like C/C++. Let me introduce two keywords widely used in CUDA programming model: host and device. The host is the CPU available in the system.

Is CUDA faster?

It is interesting to note that it is faster to perform the CPU task for small matrixes. Where for larger arrays, the CUDA outperforms the CPU by large margins. On a large scale, it looks like the CUDA times are not increasing, but if we only plot the CUDA times, we can see that it also increases linearly.


1 Answers

Thanks for the post. Glad to hear initial results were giving some speedup. I work on ArrayFire and can chime in here on your questions.

First and foremost, code is really required here for anyone to help with specificity. Can you share the code you wrote?

Second, you should think about CUDA and ArrayFire in the following way: CUDA is a way to program the GPU that provides you with the ability to write any GPU code you want. But there is a huge difference between naive CUDA code (often slower than the CPU) and expert, time-staking, hand-optimized CUDA code. ArrayFire (and some other GPU libraries like CUBLAS) have many man-years of optimizations poured into them, and are typically going to give better results than most normal people will have time to achieve on their own. However, there is also variability in how well someone uses ArrayFire (or other libraries). There are variables that can and should be tweaked in the usage of ArrayFire library calls to get the best performance. If you post your code, we can help share some of those here.

Third, ArrayFire uses CUBLAS in the functions that rely on BLAS, so you're not likely to see much difference using CUBLAS directly.

Fourth, yes, ArrayFire uses all the optimizations that are available in the NVIDIA CUDA Programming Guide for (e.g. faster data-transfer and reducing memory bank conflicts like you mention). That's where the bulk of ArrayFire development is focused, on optimizing those sorts of things.

Finally, the data discrepancies you noticed are likely due to that nature of CPU vs GPU computing. Since they are different devices, you will often see slightly different results. It's not that the CPU gives better results than the GPU, but rather that they are both working with finite amounts of precision in slightly different ways. If you're using single-precision instead of double, you might consider that. Posting code will let us help on that too.

Happy to expand my answer once code is posted.

like image 132
arrayfire Avatar answered Dec 31 '22 22:12

arrayfire