I performed element-wise multiplication using Torch with GPU support and Numpy using the functions below and found that Numpy loops faster than Torch which shouldn't be the case, I doubt.
I want to know how to perform general arithmetic operations with Torch using GPU.
Note: I ran these code snippets in Google Colab notebook
Define the default tensor type to enable global GPU flag
torch.set_default_tensor_type(torch.cuda.FloatTensor if
torch.cuda.is_available() else
torch.FloatTensor)
Initialize Torch variables
x = torch.Tensor(200, 100) # Is FloatTensor
y = torch.Tensor(200,100)
Function in question
def mul(d,f):
g = torch.mul(d,f).cuda() # I explicitly called cuda() which is not necessary
return g
When call the function above as
%timeit mul(x,y)
Returns:
The slowest run took 10.22 times longer than the fastest. This could mean hat an intermediate result is being cached. 10000 loops, best of 3: 50.1 µs per loop
Now trial with numpy,
Used the same values from torch variables
x_ = x.data.cpu().numpy()
y_ = y.data.cpu().numpy()
def mul_(d,f):
g = d*f
return g
%timeit mul_(x_,y_)
Returns
The slowest run took 12.10 times longer than the fastest. This could mean that an intermediate result is being cached. 100000 loops, best of 3: 7.73 µs per loop
Needs some help to understand GPU enabled Torch operations.
It is nearly 15 times faster than Numpy for simple matrix multiplication!
Indexing into a pytorch tensor is an order of magnitude slower than numpy.
As you can see for small operations, NumPy performs better and as the size increases, tf-numpy provides better performance. And the performance on GPU is way better than its CPU counterpart.
Pytorch tensors are similar to numpy arrays, but can also be operated on CUDA-capable Nvidia GPU. Numpy arrays are mainly used in typical machine learning algorithms (such as k-means or Decision Tree in scikit-learn) whereas pytorch tensors are mainly used in deep learning which requires heavy matrix computation.
The problem is that your GPU operation always has to put the input on the GPU memory, and then retrieve the results from there, which is a quite costly operation.
NumPy, on the other hand, directly processes the data from the CPU/main memory, so there is almost no delay here. Additionally, your matrices are extremely small, so even in the best-case scenario, there should only be a minute difference.
This is also partially the reason why you use mini-batches when training on a GPU in neural networks: Instead of having several extremely small operations, you now have "one big bulk" of numbers that you can process in parallel.
Also note that GPU clock speeds are generally way lower than CPU clocks, so the GPU only really shines because it has way more cores. If your matrix does not utilize all of them fully, you are also likely to see a faster result on your CPU.
TL;DR: If your matrix is big enough, you will eventually see a speed-up in CUDA
than Numpy, even with the additional cost of the GPU transfer.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With