It has been firmly established that my_tensor.detach().numpy()
is the correct way to get a numpy array from a torch
tensor.
I'm trying to get a better understanding of why.
In the accepted answer to the question just linked, Blupon states that:
You need to convert your tensor to another tensor that isn't requiring a gradient in addition to its actual value definition.
In the first discussion he links to, albanD states:
This is expected behavior because moving to numpy will break the graph and so no gradient will be computed.
If you don’t actually need gradients, then you can explicitly .detach() the Tensor that requires grad to get a tensor with the same content that does not require grad. This other Tensor can then be converted to a numpy array.
In the second discussion he links to, apaszke writes:
Variable's can’t be transformed to numpy, because they’re wrappers around tensors that save the operation history, and numpy doesn’t have such objects. You can retrieve a tensor held by the Variable, using the .data attribute. Then, this should work: var.data.numpy().
I have studied the internal workings of PyTorch's autodifferentiation library, and I'm still confused by these answers. Why does it break the graph to to move to numpy? Is it because any operations on the numpy array will not be tracked in the autodiff graph?
What is a Variable? How does it relate to a tensor?
I feel that a thorough high-quality Stack-Overflow answer that explains the reason for this to new users of PyTorch who don't yet understand autodifferentiation is called for here.
In particular, I think it would be helpful to illustrate the graph through a figure and show how the disconnection occurs in this example:
import torch tensor1 = torch.tensor([1.0,2.0],requires_grad=True) print(tensor1) print(type(tensor1)) tensor1 = tensor1.numpy() print(tensor1) print(type(tensor1))
detach() is used to detach a tensor from the current computational graph. It returns a new tensor that doesn't require a gradient. When we don't need a tensor to be traced for the gradient computation, we detach the tensor from the current computational graph.
detach() operation. This operation detaches the tensor from the current computational graph.
detach () Returns a new Tensor, detached from the current graph. The result will never require gradient. This method also affects forward mode AD gradients and the result will never have forward mode AD gradients. Returned Tensor shares the same storage with the original one.
Copies are not created using detach, but gradients are blocked to share the data without gradients. Detach is useful when the tensor values are not needed in the computational graph.
I think the most crucial point to understand here is the difference between a torch.tensor
and np.ndarray
:
While both objects are used to store n-dimensional matrices (aka "Tensors"), torch.tensors
has an additional "layer" - which is storing the computational graph leading to the associated n-dimensional matrix.
So, if you are only interested in efficient and easy way to perform mathematical operations on matrices np.ndarray
or torch.tensor
can be used interchangeably.
However, torch.tensor
s are designed to be used in the context of gradient descent optimization, and therefore they hold not only a tensor with numeric values, but (and more importantly) the computational graph leading to these values. This computational graph is then used (using the chain rule of derivatives) to compute the derivative of the loss function w.r.t each of the independent variables used to compute the loss.
As mentioned before, np.ndarray
object does not have this extra "computational graph" layer and therefore, when converting a torch.tensor
to np.ndarray
you must explicitly remove the computational graph of the tensor using the detach()
command.
Computational Graph
From your comments it seems like this concept is a bit vague. I'll try and illustrate it with a simple example.
Consider a simple function of two (vector) variables, x
and w
:
x = torch.rand(4, requires_grad=True) w = torch.rand(4, requires_grad=True) y = x @ w # inner-product of x and w z = y ** 2 # square the inner product
If we are only interested in the value of z
, we need not worry about any graphs, we simply moving forward from the inputs, x
and w
, to compute y
and then z
.
However, what would happen if we do not care so much about the value of z
, but rather want to ask the question "what is w
that minimizes z
for a given x
"?
To answer that question, we need to compute the derivative of z
w.r.t w
.
How can we do that?
Using the chain rule we know that dz/dw = dz/dy * dy/dw
. That is, to compute the gradient of z
w.r.t w
we need to move backward from z
back to w
computing the gradient of the operation at each step as we trace back our steps from z
to w
. This "path" we trace back is the computational graph of z
and it tells us how to compute the derivative of z
w.r.t the inputs leading to z
:
z.backward() # ask pytorch to trace back the computation of z
We can now inspect the gradient of z
w.r.t w
:
w.grad # the resulting gradient of z w.r.t w tensor([0.8010, 1.9746, 1.5904, 1.0408])
Note that this is exactly equals to
2*y*x tensor([0.8010, 1.9746, 1.5904, 1.0408], grad_fn=<MulBackward0>)
since dz/dy = 2*y
and dy/dw = x
.
Each tensor along the path stores its "contribution" to the computation:
z tensor(1.4061, grad_fn=<PowBackward0>)
And
y tensor(1.1858, grad_fn=<DotBackward>)
As you can see, y
and z
stores not only the "forward" value of <x, w>
or y**2
but also the computational graph -- the grad_fn
that is needed to compute the derivatives (using the chain rule) when tracing back the gradients from z
(output) to w
(inputs).
These grad_fn
are essential components to torch.tensors
and without them one cannot compute derivatives of complicated functions. However, np.ndarray
s do not have this capability at all and they do not have this information.
please see this answer for more information on tracing back the derivative using backwrd()
function.
Since both np.ndarray
and torch.tensor
has a common "layer" storing an n-d array of numbers, pytorch uses the same storage to save memory:
numpy() → numpy.ndarray
Returnsself
tensor as a NumPy ndarray. This tensor and the returned ndarray share the same underlying storage. Changes to self tensor will be reflected in the ndarray and vice versa.
The other direction works in the same way as well:
torch.from_numpy(ndarray) → Tensor
Creates a Tensor from a numpy.ndarray.
The returned tensor and ndarray share the same memory. Modifications to the tensor will be reflected in the ndarray and vice versa.
Thus, when creating an np.array
from torch.tensor
or vice versa, both object reference the same underlying storage in memory. Since np.ndarray
does not store/represent the computational graph associated with the array, this graph should be explicitly removed using detach()
when sharing both numpy and torch wish to reference the same tensor.
Note, that if you wish, for some reason, to use pytorch only for mathematical operations without back-propagation, you can use with torch.no_grad()
context manager, in which case computational graphs are not created and torch.tensor
s and np.ndarray
s can be used interchangeably.
with torch.no_grad(): x_t = torch.rand(3,4) y_np = np.ones((4, 2), dtype=np.float32) x_t @ torch.from_numpy(y_np) # dot product in torch np.dot(x_t.numpy(), y_np) # the same dot product in numpy
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With