Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pyCUDA vs C performance differences?

Tags:

c

cuda

pycuda

I'm new to CUDA programming and I was wondering how the performance of pyCUDA is compared to programs implemented in plain C. Will the performance be roughly the same? Are there any bottle necks that I should be aware of?

EDIT: I obviously tried to google this issue first, and was surprised to not find any information. i.e. I would have excepted that the pyCUDA people have this question answered in their FAQ.

like image 506
memyself Avatar asked Oct 28 '11 15:10

memyself


5 Answers

If you're using CUDA -- whether directly through C or with pyCUDA -- all the heavy numerical work you're doing is done in kernels that execute on the gpu and are written in CUDA C (directly by you, or indirectly with elementwise kernels). So there should be no real difference in performance in those parts of your code.

Now, the initialization of arrays, and any post-work analysis, will be done in python (probably with numpy) if you use pyCUDA, and that generally will be significantly slower than doing it directly in a compiled language (though if you've built your numpy/scipy in such a way that it links directly to high-performance libraries, then those calls at least would perform the same in either language). But hopefully, your initialization and finalization are small fractions of the total amount of work you have to do, so that even if there is significant overhead there, it still hopefully won't have a huge impact on overall runtime.

And in fact if it turns out that the python parts of the computation does hurt your application's performance, starting out doing your development in pyCUDA may still be an excellent way to get started, as the development is significantly easier, and you can always re-implement those parts of the code that are too slow in Python in straight C, and call those from python, gaining some of the best of both worlds.

like image 112
Jonathan Dursi Avatar answered Nov 16 '22 17:11

Jonathan Dursi


I've been using pyCUDA for a little while an I like prototyping with it because it speeds up the process of turning an idea into working code.

With pyCUDA you will be writing the CUDA kernels using C++, and it's CUDA, so there shouldn't be a difference in performance of running that code. But there will be a difference in the performance of the code you write in Python to setup or use the results of the pyCUDA kernel vs the one you write in C.

like image 27
jkysam Avatar answered Nov 16 '22 17:11

jkysam


If you're wondering about performance differences by using pyCUDA in different ways, see SimpleSpeedTest.py included in the pyCUDA Wiki examples. It benchmarks the same task completed by a CUDA C kernel encapsulated in pyCUDA, and by several abstractions created by pyCUDA's designer. There's a performance difference.

like image 32
Peter Becich Avatar answered Nov 16 '22 17:11

Peter Becich


Make sure you're using -O3 optimizations there and use nvprof/nvvp to profile your kernels if you're using PyCUDA and you want to get high performance. If you want to use Cuda from Python, PyCUDA is probably THE choice. Because interfacing C++/Cuda code via Python is just hell otherwise. You have to write a hell lot of ugly wrappers. And for numpy integration even more hardcore wrap-up code would be necessary.

like image 1
Íhor Mé Avatar answered Nov 16 '22 16:11

Íhor Mé


I was looking for an answer for the original question in this post and I see the problem Is deeper as I thought.

I my experience, I compared Cuda kernels and CUFFT's written in C with that written in PyCuda. Surprisingly, I found that, on my computer, the performance of suming, multiplying or making FFT's vary from each implentatiom. For example, I got almost the same performance in cuFFT for vector sizes until 2^23 elements. However, suming and multiplying complex vectors show some troubles. The speed up obtained in C/Cuda was ~6X for N=2^17, whilst in PyCuda only ~3X. It also depends on the way that the sumation was performed. By using SourceModule and wrapping the Raw Cuda code, I found the problem that my kernel, for complex128 vectors, was limitated for a lower N (<=2^16) than that used for gpuarray (<=2^24).

As a conclusion, is a good job testing and comparing the two sides of the problem and evaluate if it is convenient spend time in writing a Cuda script or gain readbility and pay the cost of a lower performance.

like image 2
Alfredo Daniel Sánchez Avatar answered Nov 16 '22 17:11

Alfredo Daniel Sánchez