The Tensorflow tutorial here refers to their basic implementation which you can find on github here, where the Tensorflow authors implement word2vec vector embedding training/evaluation with the Skipgram model. My question is about the actual generation of (target, context) pairs in the <code>generate_batch()</code> function. On this line Tensorflow authors randomly sample nearby target indices from the "center" word index in the sliding window of words. However, they also keep a data structure <code>targets_to_avoid</code> to which they add first the "center" context word (which of course we don't want to sample) but ALSO other words after we add them. My questions are as follows: <ol> <li>Why sample from this sliding window around the word, why not just have a loop and use them all rather than sampling? It seems strange they would worry about performance/memory in <code>word2vec_basic.py</code> (their "basic" implementation). </li> <li>Whatever the answer to 1) is, why are they both sampling and keeping track of what they've selected with <code>targets_to_avoid</code>? If they wanted truly random, they'd use selection with replacement, and if they wanted to ensure they got all the options, they should have just used a loop and gotten them all in the first place!</li> <li>Does the built in tf.models.embedding.gen_word2vec work this way too? If so where can I find the source code? (couldn't find the .py file in the Github repo)</li> </ol> Thanks!

I tried out your proposed way to generate batches - having a loop and using the whole skip-window. The results are: 1. Faster generation of batches For a batch size of 128 and a skip window of 5 <ul> <li>generating batches by looping over the data one by one takes 0.73s per 10,000 batches</li> <li>generating batches with the tutorial code and <code>num_skips=2</code> takes 3.59s per 10,000 batches</li> </ul> 2. Higher danger of overfitting Keeping the rest of the tutorial code as it is, I trained the model with both ways and logged the average loss every 2000 steps: <img src="https://i.stack.imgur.com/mZvUi.png" alt="enter image description here"> This pattern occurred repeatedly. It shows that using 10 samples per word instead of 2 can cause overfitting. Here is the code that I used for generating the batches. It replaces the tutorial's <code>generate_batch</code> function. <pre class="prettyprint"><code>data_index = 0 def generate_batch(batch_size, skip_window): global data_index batch = np.ndarray(shape=(batch_size), dtype=np.int32) # Row labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32) # Column # For each word in the data, add the context to the batch and the word to the labels batch_index = 0 while batch_index < batch_size: context = data[get_context_indices(data_index, skip_window)] # Add the context to the remaining batch space remaining_space = min(batch_size - batch_index, len(context)) batch[batch_index:batch_index + remaining_space] = context[0:remaining_space] labels[batch_index:batch_index + remaining_space] = data[data_index] # Update the data_index and the batch_index batch_index += remaining_space data_index = (data_index + 1) % len(data) return batch, labels </code></pre> Edit: The <code>get_context_indices</code> is a simple function, which returns the index slice in the skip_window around data_index. See the slice() documentation for more info.

Tensorflow implementation of word2vec

Tags:

python

tensorflow

word2vec

The Tensorflow tutorial here refers to their basic implementation which you can find on github here, where the Tensorflow authors implement word2vec vector embedding training/evaluation with the Skipgram model.

My question is about the actual generation of (target, context) pairs in the generate_batch() function.

On this line Tensorflow authors randomly sample nearby target indices from the "center" word index in the sliding window of words.

However, they also keep a data structure targets_to_avoid to which they add first the "center" context word (which of course we don't want to sample) but ALSO other words after we add them.

My questions are as follows:

Why sample from this sliding window around the word, why not just have a loop and use them all rather than sampling? It seems strange they would worry about performance/memory in word2vec_basic.py (their "basic" implementation).
Whatever the answer to 1) is, why are they both sampling and keeping track of what they've selected with targets_to_avoid? If they wanted truly random, they'd use selection with replacement, and if they wanted to ensure they got all the options, they should have just used a loop and gotten them all in the first place!
Does the built in tf.models.embedding.gen_word2vec work this way too? If so where can I find the source code? (couldn't find the .py file in the Github repo)

Thanks!

791

asked Jun 29 '16 22:06

lollercoaster

2 Answers

I tried out your proposed way to generate batches - having a loop and using the whole skip-window. The results are:

1. Faster generation of batches

For a batch size of 128 and a skip window of 5

generating batches by looping over the data one by one takes 0.73s per 10,000 batches
generating batches with the tutorial code and num_skips=2 takes 3.59s per 10,000 batches

2. Higher danger of overfitting

Keeping the rest of the tutorial code as it is, I trained the model with both ways and logged the average loss every 2000 steps:

enter image description here

This pattern occurred repeatedly. It shows that using 10 samples per word instead of 2 can cause overfitting.

Here is the code that I used for generating the batches. It replaces the tutorial's generate_batch function.

data_index = 0

def generate_batch(batch_size, skip_window):
    global data_index
    batch = np.ndarray(shape=(batch_size), dtype=np.int32)  # Row
    labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)  # Column

    # For each word in the data, add the context to the batch and the word to the labels
    batch_index = 0
    while batch_index < batch_size:
        context = data[get_context_indices(data_index, skip_window)]

        # Add the context to the remaining batch space
        remaining_space = min(batch_size - batch_index, len(context))
        batch[batch_index:batch_index + remaining_space] = context[0:remaining_space]
        labels[batch_index:batch_index + remaining_space] = data[data_index]

        # Update the data_index and the batch_index
        batch_index += remaining_space
        data_index = (data_index + 1) % len(data)

    return batch, labels

Edit: The get_context_indices is a simple function, which returns the index slice in the skip_window around data_index. See the slice() documentation for more info.

198

answered Oct 21 '22 09:10

Kilian Batzner

There is a parameter named num_skips which denotes the number of (input, output) pairs generated from the single window: [skip_window target skip_window]. So num_skips restrict the number of context words we would use as output words. And that is why the generate_batch function assert num_skips <= 2*skip_window. The code just randomly pick up num_skip context words to construct training pairs with target. But I don't know how num_skips affects the performance.

answered Oct 21 '22 09:10

user1903382

Related questions
                            
                                How to define custom float-type numpy dtypes (C-API)
                            
                                Data binning: irregular polygons to regular mesh
                            
                                Python command line program: generate man page from existing documentation and include in the distribution
                            
                                Ngram model and perplexity in NLTK
                            
                                python os.path.getmtime() time not changing
                            
                                How does C implements the Python assignment of large numbers
                            
                                Python MemoryError: cannot allocate array memory
                            
                                Skipping error when installing packaging using PIP
                            
                                Python: Update value of element in heapq
                            
                                Emacs: Program named "virtualenv" does not exist
                            
                                Python / Django Rest Framework weird error happening only when using debugger
                            
                                knitr - error when importing python module
                            
                                User input with a timeout, in a loop
                            
                                Wrong pip in conda env
                            
                                argparse - Build back command line
                            
                                How to change space between bars when drawing multiple barplots in pandas?
                            
                                Unit testing elastic search inside Django app
                            
                                Spark program gives odd results when ran on standalone cluster
                            
                                Troubleshooting Websockets with EC2 on AWS using Django
                            
                                Generate all possible outcomes of k balls in n bins (sum of multinomial / categorical outcomes)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With