I am a bit confused about the use of the function tf.matmul()
in TensorFlow. My question might be more about the theory of deep learning, though. Say you have an input X and weight matrix W (assuming zero bias), I want to compute WX as an output which could be done by tf.matmul(W, X)
. However, in the tutorial MNIST for beginners it is reversed and tf.matmul(X, W)
is used instead. On the other hand, in the next tutorial TensorFlow Mechanics 101, tf.matmul(W, X)
is used. Since the matrix sizes are important for multiplication I wonder if someone can clarify this issue.
I think you must be misreading the mechanics 101 tutorial - or could you point to the specific line?
In general, for a network layer, I think of the inputs "flowing through" the weights. To represent that, I write tf.matmul(Inputs, Weights)
to produce the output of that layer. That output may then have a bias b
added to it, and the result of that fed into a nonlinear function such as a relu, and then into another tf.matmul
as the input for the next layer.
Second, remember that the Weights matrix may be sized to produce multiple outputs. That's why it's a matrix, not just a vector. For example, if you wanted two hidden units and you had five input features, you would use a shape [5, 2]
weight matrix, like this (shown in numpy for ease of exposition - you can do the same thing in tensorflow):
import numpy as np
a = np.array([1, 2, 3, 4, 5])
W = np.array([[.5, .6], [.7, .8], [.9, .1], [.2, .3], [.4, .5]])
>>> np.dot(a, W)
array([ 7.4, 6.2])
This has the nice behavior that if you then add a batch dimension to a
, it still works:
a = np.array[[1, 2, 3, 4, 5],
[6, 7, 8, 9, 0]]
>>> np.dot(a, W)
array([[ 7.4, 6.2],
[ 20.9, 17.7]])
This is exactly what you're doing when you use tf.matmul to go from input features to hidden units, or from one layer of hidden units to another.
I don't know much about TensorFlow, but intuitively I feel that the confusion is regarding the data representation of input. When you say you want to multiply an input X
with a weight W
I think what you mean is that you want to multiply each dimension (feature) with its corresponding weight and take the sum. So if you have an input x
with say m
dimensions, you should have a weight vector w
with m
values (m+1
if you consider the bias).
Now if you choose to represent the different training instances as rows of a matrix X
, you would have to perform X * w
, instead if you choose to represent them as columns, you would do w^T * X
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With