PyTorch CUDA vs Numpy for arithmetic operations? Fastest?

I performed element-wise multiplication using Torch with GPU support and Numpy using the functions below and found that Numpy loops faster than Torch which shouldn't be the case, I doubt.

I want to know how to perform general arithmetic operations with Torch using GPU.

Note: I ran these code snippets in Google Colab notebook

Define the default tensor type to enable global GPU flag

torch.set_default_tensor_type(torch.cuda.FloatTensor if 
                              torch.cuda.is_available() else 
                              torch.FloatTensor)

Initialize Torch variables

x = torch.Tensor(200, 100)  # Is FloatTensor
y = torch.Tensor(200,100)

Function in question

def mul(d,f):
    g = torch.mul(d,f).cuda()  # I explicitly called cuda() which is not necessary
    return g

When call the function above as %timeit mul(x,y)

Returns:

The slowest run took 10.22 times longer than the fastest. This could mean hat an intermediate result is being cached. 10000 loops, best of 3: 50.1 µs per loop

Now trial with numpy,

Used the same values from torch variables

x_ = x.data.cpu().numpy()
y_ = y.data.cpu().numpy()

def mul_(d,f):
    g = d*f
    return g

%timeit mul_(x_,y_)

Returns

The slowest run took 12.10 times longer than the fastest. This could mean that an intermediate result is being cached. 100000 loops, best of 3: 7.73 µs per loop

Needs some help to understand GPU enabled Torch operations.

Which is faster NumPy or PyTorch?

It is nearly 15 times faster than Numpy for simple matrix multiplication!

Is PyTorch slower than NumPy?

Indexing into a pytorch tensor is an order of magnitude slower than numpy.

Is NumPy faster on GPU?

As you can see for small operations, NumPy performs better and as the size increases, tf-numpy provides better performance. And the performance on GPU is way better than its CPU counterpart.

What is the main difference between PyTorch tensor and NumPy array?

Pytorch tensors are similar to numpy arrays, but can also be operated on CUDA-capable Nvidia GPU. Numpy arrays are mainly used in typical machine learning algorithms (such as k-means or Decision Tree in scikit-learn) whereas pytorch tensors are mainly used in deep learning which requires heavy matrix computation.

GPU operations have to additionally get memory to/from the GPU

The problem is that your GPU operation always has to put the input on the GPU memory, and then retrieve the results from there, which is a quite costly operation.

NumPy, on the other hand, directly processes the data from the CPU/main memory, so there is almost no delay here. Additionally, your matrices are extremely small, so even in the best-case scenario, there should only be a minute difference.

This is also partially the reason why you use mini-batches when training on a GPU in neural networks: Instead of having several extremely small operations, you now have "one big bulk" of numbers that you can process in parallel.
Also note that GPU clock speeds are generally way lower than CPU clocks, so the GPU only really shines because it has way more cores. If your matrix does not utilize all of them fully, you are also likely to see a faster result on your CPU.

TL;DR: If your matrix is big enough, you will eventually see a speed-up in CUDA than Numpy, even with the additional cost of the GPU transfer.

PyTorch CUDA vs Numpy for arithmetic operations? Fastest?

Tags:

python-3.x

numpy

gpu

pytorch

Saranraj Nambusubramaniyan

People also ask

1 Answers

GPU operations have to additionally get memory to/from the GPU

dennlinger

Recent Activity

Donate For Us

PyTorch CUDA vs Numpy for arithmetic operations? Fastest?

Tags:

python-3.x

numpy

gpu

pytorch

Saranraj Nambusubramaniyan

People also ask

1 Answers

GPU operations have to additionally get memory to/from the GPU

dennlinger

Related questions

Recent Activity

Donate For Us