How can torch multiply two 10000*10000 matrices in almost zero time? Why does the speed change so much from 349 ms down to 999 µs?

Tags:

Here is an excerpt from Jupyter:

In [1]:

import torch, numpy as np, datetime
cuda = torch.device('cuda')

In [2]:

ac = torch.randn(10000, 10000).to(cuda)
bc = torch.randn(10000, 10000).to(cuda)
%time cc = torch.matmul(ac, bc)
print(cc[0, 0], torch.sum(ac[0, :] * bc[:, 0]))

Wall time: 349 ms

tensor(17.0374, device='cuda:0') tensor(17.0376, device='cuda:0')

The time is low but still reasonable (0.35 sec for 1e12 multiplications)

But if we repeat the same:

ac = torch.randn(10000, 10000).to(cuda)
bc = torch.randn(10000, 10000).to(cuda)
%time cc = torch.matmul(ac, bc)
print(cc[0, 0], torch.sum(ac[0, :] * bc[:, 0]))

Wall time: 999 µs

tensor(-78.7172, device='cuda:0') tensor(-78.7173, device='cuda:0')

1e12 multiplications in 1ms?!

Why did the time change from 349ms to 1ms?

Info:

Tested on GeForce RTX 2070;
Can be reproduced on Google Colab.

343

asked Jul 07 '19 12:07

user31264

1 Answers

There is already a discussion about this on Discuss PyTorch: Measuring GPU tensor operation speed.

I'd like to highlight two comments from that thread:

From @apaszke:

[...] the GPU executes all operations asynchronously, so you need to insert proper barriers for your benchmarks to be correct

From @ngimel:

I believe cublas handles are allocated lazily now, which means that first operation requiring cublas will have an overhead of creating cublas handle, and that includes some internal allocations. So there’s no way to avoid it other than calling some function requiring cublas before the timing loop.

Basically, you have to synchronize() to have a proper measurement:

import torch

x = torch.randn(10000, 10000).to("cuda")
w = torch.randn(10000, 10000).to("cuda")
# ensure that context initialization finish before you start measuring time
torch.cuda.synchronize()

%time y = x.mm(w.t()); torch.cuda.synchronize()

CPU times: user 288 ms, sys: 191 ms, total: 479 ms

Wall time: 492 ms

x = torch.randn(10000, 10000).to("cuda")
w = torch.randn(10000, 10000).to("cuda")
# ensure that context initialization finish before you start measuring time
torch.cuda.synchronize()

%time y = x.mm(w.t()); torch.cuda.synchronize()

CPU times: user 237 ms, sys: 231 ms, total: 468 ms

Wall time: 469 ms

answered Sep 21 '22 21:09

Berriel

Related questions
                            
                                Is it possible to force pandas not to convert data type when using DataFrame.replace
                            
                                Combine dictionary of dataframes into 1 single dataframe
                            
                                What is the expected input range for working with Keras VGG models?
                            
                                Returning probabilities in a classification prediction in Keras?
                            
                                "<Message: title>" needs to have a value for field "id" before this many-to-many relationship can be used.
                            
                                How to scale dataframes consistently MinMaxScaler() sklearn
                            
                                Can sklearn DecisionTreeClassifier truly work with categorical data?
                            
                                Select N rows above and below a specific row in pandas
                            
                                Why does the python-selenium-webdriver 'quit' not quit?
                            
                                ProcessPoolExecutor logging fails to log inside function on Windows but not on Unix / Mac
                            
                                Python Pytest unpack fixture
                            
                                sequence matching algorithm in python
                            
                                Second argument of the iter() method
                            
                                Unable to pip install in Docker image as agent through Jenkins declarative pipeline
                            
                                Simple Mercurial extension fails to import
                            
                                Scientific Notation formatting in Python
                            
                                Large amount of lists concatenation [duplicate]
                            
                                Is there a way to start a plot already zoomed on a specific area using plotly?
                            
                                How to implement neural network pruning?
                            
                                ChromeDriver ERR_SSL_PROTOCOL_ERROR despite --ignore-certificate-errors

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can torch multiply two 10000*10000 matrices in almost zero time? Why does the speed change so much from 349 ms down to 999 µs?

Tags:

performance

python

jupyter-notebook

pytorch

user31264

People also ask

1 Answers

Berriel

Recent Activity

Donate For Us