Here is an excerpt from Jupyter:
In [1]
:
import torch, numpy as np, datetime
cuda = torch.device('cuda')
In [2]
:
ac = torch.randn(10000, 10000).to(cuda)
bc = torch.randn(10000, 10000).to(cuda)
%time cc = torch.matmul(ac, bc)
print(cc[0, 0], torch.sum(ac[0, :] * bc[:, 0]))
Wall time: 349 ms
tensor(17.0374, device='cuda:0') tensor(17.0376, device='cuda:0')
The time is low but still reasonable (0.35 sec for 1e12 multiplications)
But if we repeat the same:
ac = torch.randn(10000, 10000).to(cuda)
bc = torch.randn(10000, 10000).to(cuda)
%time cc = torch.matmul(ac, bc)
print(cc[0, 0], torch.sum(ac[0, :] * bc[:, 0]))
Wall time: 999 µs
tensor(-78.7172, device='cuda:0') tensor(-78.7173, device='cuda:0')
1e12
multiplications in 1ms
?!
Why did the time change from 349ms to 1ms?
Info:
For matrix multiplication in PyTorch, use torch.mm() . Numpy's np. dot() in contrast is more flexible; it computes the inner product for 1D arrays and performs matrix multiplication for 2D arrays.
Multiply two or more tensors using torch. mul() and assign the value to a new variable. You can also multiply a scalar quantity and a tensor. Multiplying the tensors using this method does not make any change in the original tensors.
Matlab - The Complete Course Matrix multiplication is possible only if the number of columns n in A is equal to the number of rows n in B. In matrix multiplication, the elements of the rows in the first matrix are multiplied with corresponding columns in the second matrix.
There is already a discussion about this on Discuss PyTorch: Measuring GPU tensor operation speed.
I'd like to highlight two comments from that thread:
[...] the GPU executes all operations asynchronously, so you need to insert proper barriers for your benchmarks to be correct
I believe cublas handles are allocated lazily now, which means that first operation requiring cublas will have an overhead of creating cublas handle, and that includes some internal allocations. So there’s no way to avoid it other than calling some function requiring cublas before the timing loop.
Basically, you have to synchronize()
to have a proper measurement:
import torch
x = torch.randn(10000, 10000).to("cuda")
w = torch.randn(10000, 10000).to("cuda")
# ensure that context initialization finish before you start measuring time
torch.cuda.synchronize()
%time y = x.mm(w.t()); torch.cuda.synchronize()
CPU times: user 288 ms, sys: 191 ms, total: 479 ms
Wall time: 492 ms
x = torch.randn(10000, 10000).to("cuda")
w = torch.randn(10000, 10000).to("cuda")
# ensure that context initialization finish before you start measuring time
torch.cuda.synchronize()
%time y = x.mm(w.t()); torch.cuda.synchronize()
CPU times: user 237 ms, sys: 231 ms, total: 468 ms
Wall time: 469 ms
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With