What is the purpose of weights and biases in tensorflow word2vec example?

Tags:

I'm trying to understand how word2vec example works and don't really understand what is the purpose of weights and biases passed into nse_loss function. There are two variable inputs into the function: weights (plus biases) and embedding.

# Look up embeddings for inputs.
embeddings = tf.Variable(
    tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
embed = tf.nn.embedding_lookup(embeddings, train_inputs)

# Construct the variables for the NCE loss
nce_weights = tf.Variable(
    tf.truncated_normal([vocabulary_size, embedding_size],
                        stddev=1.0 / math.sqrt(embedding_size)))
nce_biases = tf.Variable(tf.zeros([vocabulary_size]))

Both are randomly initialized and (as far as I understand) both are subject to updates during learning.

# Compute the average NCE loss for the batch.
loss = tf.reduce_mean(
  tf.nn.nce_loss(nce_weights, nce_biases, embed, train_labels,
                 num_sampled, vocabulary_size))

I suppose both of them should represent trained model. However weights and biases are never used later on for similarity calculations. Instead, only one component is used:

# Compute the cosine similarity between minibatch examples and all embeddings.
norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keep_dims=True))
normalized_embeddings = embeddings / norm
valid_embeddings = tf.nn.embedding_lookup(
  normalized_embeddings, valid_dataset)
similarity = tf.matmul(
  valid_embeddings, normalized_embeddings, transpose_b=True)

So what about second component of the model? Why weighs and biases are being ignored?

Thank you.

850

asked Jun 23 '16 04:06

WarGoth

2 Answers

In word2vec what you want is a vector representation of words. In order to do that you can use, among other things, a neural network. So you have inputs neurons, outputs and hidden layers. What you do to learn the vector representation is to have a hidden layer which number of neurons is the same as the dimension you want in your vectors. There is one input per word and one output per word. And then you train the network to learn the input from the output but in the middle you have a smaller layer which you can see it as a codification of the input in a vector. So here are the weights and biases. But you don't need them later, what you use for testing is a dictionary which contains the word and the vector which represents that word. This is faster than running the neural network to get the representation. That is why you don't see it later.

The last code you write about the cosine distance is to know which vectors are closed to your calculated vector. You have some words (vectors) you make some operations (like: king - man + woman) and then you have a vector that you want to convert in the result. This is the cosine function run among all the vectors (queen would have the minimum distance with the result vector of the operation).

To sum up, you don't see the weight and bias in the validation phase because you don't need them. You use the dictionary you have created in the training.

UPDATE s0urcer has explained better how the vector representation is created.

The input layer and the output layer of the networks represents words. It means the value is 0 if the word is not there and 1 if the word is there. First position is one word, second another one, etc. You have as input/output neurons as words.

The middle layer is the context, or you vector representation of the words.

Now you train the network with sentences or group of consecutive words. From this group you take one word and set it in the inputs and the other words are the outputs of the network. So basically the network learns how a word is related with other words in its context.

To get the vector representation of each word you set the input neuron of that word to 1 and see the values of the context layer (the middle layer). Those values are the values of the vector. As all the inputs are 0 except the word that is 1, those values are the weights of the connections of the input neuron with the context.

You don't use the network later because you don't need to calculate all the values of the context layer, that will be slower. You only need to check in your dictionary what are those values for the word.

answered Nov 08 '22 04:11

jorgemf

The idea of skip-gramm is comparing words by their contexts. So we consider words equal if they appear in equal contexts. The first layer of NN represents words vector encodings (basically what is called embeddings). The second layer represents context. Every time we take just one row (Ri) of first layer (because input vector always looks like 0, ..., 0, 1, 0, ..., 0) and multiply it by all columns of second layer (Cj , j = 1..num of words) and that product will be the output of NN. We train neural network to have maximum output components Ri * Cj if word i and j appear nearby (in the same context) often. During each cycle of training we tune only one Ri (again because of the way input vectors are chosen) and all Cj, j = 1..w. When training ends we toss the matrix of the second layer because it represents context. We use only matrix of the first layer which represents vector encoding of the words.

answered Nov 08 '22 04:11

s0urcer

Related questions
                            
                                Probabilistic Generation of Semantic Networks
                            
                                How to propagate/fire recurrent neural networks(RNN)?
                            
                                How to deal with feature vector of variable length?
                            
                                Serve trained Tensorflow model with REST API using Flask?
                            
                                Why word2vec doesn't use regularization?
                            
                                Generalized dice loss for multi-class segmentation: keras implementation
                            
                                Multi-output regression
                            
                                clustering with cosine similarity
                            
                                What's the difference between kmeans and kmeans2 in scipy?
                            
                                How to handle categorical variables in sklearn GradientBoostingClassifier?
                            
                                How to disable keras warnings?
                            
                                Noisy training loss
                            
                                How to install tensorflow GPU version on VirtualBox Ubuntu OS. And host OS is windows 10
                            
                                Imbalanced classes in multi-class classification problem
                            
                                What machine learning benchmarks are out there?
                            
                                Ordered Logit in Python?
                            
                                Making a meaningful sentence from a given set of words [closed]
                            
                                Weighted linear regression with Scikit-learn
                            
                                What is stratified bootstrap?
                            
                                String Distance Matrix in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is the purpose of weights and biases in tensorflow word2vec example?

Tags:

machine-learning

tensorflow

deep-learning

WarGoth

People also ask

2 Answers

jorgemf

s0urcer

Recent Activity

Donate For Us