Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can torch multiply two 10000*10000 matrices in almost zero time? Why does the speed change so much from 349 ms down to 999 µs?

Here is an excerpt from Jupyter:

In [1]:

import torch, numpy as np, datetime
cuda = torch.device('cuda')

In [2]:

ac = torch.randn(10000, 10000).to(cuda)
bc = torch.randn(10000, 10000).to(cuda)
%time cc = torch.matmul(ac, bc)
print(cc[0, 0], torch.sum(ac[0, :] * bc[:, 0]))

Wall time: 349 ms

tensor(17.0374, device='cuda:0') tensor(17.0376, device='cuda:0')

The time is low but still reasonable (0.35 sec for 1e12 multiplications)

But if we repeat the same:

ac = torch.randn(10000, 10000).to(cuda)
bc = torch.randn(10000, 10000).to(cuda)
%time cc = torch.matmul(ac, bc)
print(cc[0, 0], torch.sum(ac[0, :] * bc[:, 0]))

Wall time: 999 µs

tensor(-78.7172, device='cuda:0') tensor(-78.7173, device='cuda:0')

1e12 multiplications in 1ms?!

Why did the time change from 349ms to 1ms?

Info:

  • Tested on GeForce RTX 2070;
  • Can be reproduced on Google Colab.
like image 343
user31264 Avatar asked Jul 07 '19 12:07

user31264


People also ask

How do you multiply matrices with PyTorch?

For matrix multiplication in PyTorch, use torch.mm() . Numpy's np. dot() in contrast is more flexible; it computes the inner product for 1D arrays and performs matrix multiplication for 2D arrays.

How do you multiply torch tensors?

Multiply two or more tensors using torch. mul() and assign the value to a new variable. You can also multiply a scalar quantity and a tensor. Multiplying the tensors using this method does not make any change in the original tensors.

How does Matlab do matrix multiplication?

Matlab - The Complete Course Matrix multiplication is possible only if the number of columns n in A is equal to the number of rows n in B. In matrix multiplication, the elements of the rows in the first matrix are multiplied with corresponding columns in the second matrix.


1 Answers

There is already a discussion about this on Discuss PyTorch: Measuring GPU tensor operation speed.

I'd like to highlight two comments from that thread:

  • From @apaszke:

[...] the GPU executes all operations asynchronously, so you need to insert proper barriers for your benchmarks to be correct

  • From @ngimel:

I believe cublas handles are allocated lazily now, which means that first operation requiring cublas will have an overhead of creating cublas handle, and that includes some internal allocations. So there’s no way to avoid it other than calling some function requiring cublas before the timing loop.


Basically, you have to synchronize() to have a proper measurement:

import torch

x = torch.randn(10000, 10000).to("cuda")
w = torch.randn(10000, 10000).to("cuda")
# ensure that context initialization finish before you start measuring time
torch.cuda.synchronize()

%time y = x.mm(w.t()); torch.cuda.synchronize()

CPU times: user 288 ms, sys: 191 ms, total: 479 ms

Wall time: 492 ms

x = torch.randn(10000, 10000).to("cuda")
w = torch.randn(10000, 10000).to("cuda")
# ensure that context initialization finish before you start measuring time
torch.cuda.synchronize()

%time y = x.mm(w.t()); torch.cuda.synchronize()

CPU times: user 237 ms, sys: 231 ms, total: 468 ms

Wall time: 469 ms

like image 87
Berriel Avatar answered Sep 21 '22 21:09

Berriel