The Tensorflow tutorial here refers to their basic implementation which you can find on github here, where the Tensorflow authors implement word2vec vector embedding training/evaluation with the Skipgram model.
My question is about the actual generation of (target, context) pairs in the generate_batch()
function.
On this line Tensorflow authors randomly sample nearby target indices from the "center" word index in the sliding window of words.
However, they also keep a data structure targets_to_avoid
to which they add first the "center" context word (which of course we don't want to sample) but ALSO other words after we add them.
My questions are as follows:
word2vec_basic.py
(their "basic" implementation). targets_to_avoid
? If they wanted truly random, they'd use selection with replacement, and if they wanted to ensure they got all the options, they should have just used a loop and gotten them all in the first place!Thanks!
To implement Word2Vec, there are two flavors to choose from — Continuous Bag-Of-Words (CBOW) or continuous Skip-gram (SG). In short, CBOW attempts to guess the output (target word) from its neighbouring words (context words) whereas continuous Skip-Gram guesses the context words from a target word.
Some key differences between TF-IDF and word2vec is that TF-IDF is a statistical measure that we can apply to terms in a document and then use that to form a vector whereas word2vec will produce a vector for a term and then more work may need to be done to convert that set of vectors into a singular vector or other ...
I tried out your proposed way to generate batches - having a loop and using the whole skip-window. The results are:
1. Faster generation of batches
For a batch size of 128 and a skip window of 5
num_skips=2
takes 3.59s per 10,000 batches2. Higher danger of overfitting
Keeping the rest of the tutorial code as it is, I trained the model with both ways and logged the average loss every 2000 steps:
This pattern occurred repeatedly. It shows that using 10 samples per word instead of 2 can cause overfitting.
Here is the code that I used for generating the batches. It replaces the tutorial's generate_batch
function.
data_index = 0
def generate_batch(batch_size, skip_window):
global data_index
batch = np.ndarray(shape=(batch_size), dtype=np.int32) # Row
labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32) # Column
# For each word in the data, add the context to the batch and the word to the labels
batch_index = 0
while batch_index < batch_size:
context = data[get_context_indices(data_index, skip_window)]
# Add the context to the remaining batch space
remaining_space = min(batch_size - batch_index, len(context))
batch[batch_index:batch_index + remaining_space] = context[0:remaining_space]
labels[batch_index:batch_index + remaining_space] = data[data_index]
# Update the data_index and the batch_index
batch_index += remaining_space
data_index = (data_index + 1) % len(data)
return batch, labels
Edit: The get_context_indices
is a simple function, which returns the index slice in the skip_window around data_index. See the slice() documentation for more info.
There is a parameter named num_skips
which denotes the number of (input, output) pairs generated from the single window: [skip_window target skip_window]. So num_skips
restrict the number of context words we would use as output words. And that is why the generate_batch function assert num_skips <= 2*skip_window
. The code just randomly pick up num_skip
context words to construct training pairs with target.
But I don't know how num_skips
affects the performance.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With