ArrayFire versus raw CUDA programming?

Q: What does ArrayFire do?

ArrayFire is a comprehensive, open source function library with interfaces for C, C++, Java, R and Fortran. It integrates with any CUDA application, and contains an array-based API for easy programmability.

Q: Is CUDA a programming model?

This post outlines the main concepts of the CUDA programming model by outlining how they are exposed in general-purpose programming languages like C/C++. Let me introduce two keywords widely used in CUDA programming model: host and device. The host is the CPU available in the system.

Q: Is CUDA faster?

It is interesting to note that it is faster to perform the CPU task for small matrixes. Where for larger arrays, the CUDA outperforms the CPU by large margins. On a large scale, it looks like the CUDA times are not increasing, but if we only plot the CUDA times, we can see that it also increases linearly.

Tags:

cuda

gpu

arrayfire

I am quite new to GPU programming, but since I have a computationally intensive task I have turned to the GPU for possible performance gains.

I tried rewriting my program with ArrayFire Free version. It is indeed faster than my CPU routine with multi-threading enabled, but not to the degree I expected (that is, < 100% speedup), and the returned results are not quite right (< 1% error compared to CPU routine, assuming the CPU routine's results are correct).

My task is mainly element-wise float-32 maths operations on large matrices (300MB-500MB size), with little if-thens/switch-cases etc. I guess the performance bottleneck is likely the bandwidth between CPU and GPU memory since there is a lot of data-reading, etc. The GPU I tested is a GeForce 580GTX with 3GB of video memory.

Is there still some significant room for optimization if I write raw CUDA code (with CUBLAS etc. and average optimization) instead of using ArrayFire for my task? I read some NVIDIA optimization guides; it seems that there is some memory-access tricks there for faster data-access and reducing bank-conflicts. Does ArrayFire use these general tricks automatically or not?

846

asked Sep 29 '12 21:09

brbs2002

1 Answers

Thanks for the post. Glad to hear initial results were giving some speedup. I work on ArrayFire and can chime in here on your questions.

First and foremost, code is really required here for anyone to help with specificity. Can you share the code you wrote?

Second, you should think about CUDA and ArrayFire in the following way: CUDA is a way to program the GPU that provides you with the ability to write any GPU code you want. But there is a huge difference between naive CUDA code (often slower than the CPU) and expert, time-staking, hand-optimized CUDA code. ArrayFire (and some other GPU libraries like CUBLAS) have many man-years of optimizations poured into them, and are typically going to give better results than most normal people will have time to achieve on their own. However, there is also variability in how well someone uses ArrayFire (or other libraries). There are variables that can and should be tweaked in the usage of ArrayFire library calls to get the best performance. If you post your code, we can help share some of those here.

Third, ArrayFire uses CUBLAS in the functions that rely on BLAS, so you're not likely to see much difference using CUBLAS directly.

Fourth, yes, ArrayFire uses all the optimizations that are available in the NVIDIA CUDA Programming Guide for (e.g. faster data-transfer and reducing memory bank conflicts like you mention). That's where the bulk of ArrayFire development is focused, on optimizing those sorts of things.

Finally, the data discrepancies you noticed are likely due to that nature of CPU vs GPU computing. Since they are different devices, you will often see slightly different results. It's not that the CPU gives better results than the GPU, but rather that they are both working with finite amounts of precision in slightly different ways. If you're using single-precision instead of double, you might consider that. Posting code will let us help on that too.

Happy to expand my answer once code is posted.

132

answered Dec 31 '22 22:12

arrayfire

Related questions
                            
                                Multiple processes launching CUDA kernels in parallel
                            
                                Copy an object to device?
                            
                                Understanding this CUDA kernels launch parameters
                            
                                Any particular function to initialize GPU other than the first cudaMalloc call?
                            
                                Installing CUDA 5.0 RC samples
                            
                                Building CUDA object files using cmake
                            
                                EmguCV - nvcuda.dll could not be found
                            
                                How are 2D / 3D CUDA blocks divided into warps?
                            
                                Why bother to know about CUDA Warps?
                            
                                CUDA compilation issue with CMake
                            
                                cuda memory alignment
                            
                                Installing CUDA Windows 10
                            
                                Cuda Shared Memory array variable
                            
                                Why does CUDA code run so much faster in NVIDIA Visual Profiler?
                            
                                What's the relation between nvidia driver, cuda driver and cuda toolkit?
                            
                                Efficient layout and reduction of virtual 2d data (abstract)
                            
                                Could a CUDA kernel call a cublas function?
                            
                                CUDA: Wrapping device memory allocation in C++
                            
                                CUDA atomicAdd for doubles definition error
                            
                                What is a CUDA context?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

ArrayFire versus raw CUDA programming?

Tags:

cuda

gpu

arrayfire

brbs2002

People also ask

1 Answers

arrayfire

Recent Activity

Donate For Us