I'm reading the paper below and I have some trouble , understanding the concept of negative sampling. http://arxiv.org/pdf/1402.3722v1.pdf Can anyone help , please?

Computing Softmax (Function to determine which words are similar to the current target word) is expensive since requires summing over all words in V (denominator), which is generally very large. <img src="https://i.stack.imgur.com/Akfej.png" alt="enter image description here"> What can be done? Different strategies have been proposed to approximate the softmax. These approaches can be grouped into softmax-based and sampling-based approaches. Softmax-based approaches are methods that keep the softmax layer intact, but modify its architecture to improve its efficiency (e.g hierarchical softmax). Sampling-based approaches on the other hand completely do away with the softmax layer and instead optimise some other loss function that approximates the softmax (They do this by approximating the normalization in the denominator of the softmax with some other loss that is cheap to compute like negative sampling). The loss function in Word2vec is something like: <img src="https://i.stack.imgur.com/4s4f6.png" alt="enter image description here"> Which logarithm can decompose into: <img src="https://i.stack.imgur.com/6RDai.png" alt="enter image description here"> With some mathematic and gradient formula (See more details at 6) it converted to: <img src="https://i.stack.imgur.com/fua4s.png" alt="enter image description here"> As you see it converted to binary classification task (y=1 positive class, y=0 negative class). As we need labels to perform our binary classification task, we designate all context words c as true labels (y=1, positive sample), and k randomly selected from corpora as false labels (y=0, negative sample). <hr> Look at the following paragraph. Assume our target word is "Word2vec". With window of 3, our context words are: <code>The</code>, <code>widely</code>, <code>popular</code>, <code>algorithm</code>, <code>was</code>, <code>developed</code>. These context words consider as positive labels. We also need some negative labels. We randomly pick some words from corpus (<code>produce</code>, <code>software</code>, <code>Collobert</code>, <code>margin-based</code>, <code>probabilistic</code>) and consider them as negative samples. This technique that we picked some randomly example from corpus is called negative sampling. <img src="https://i.stack.imgur.com/nnnQX.png" alt="enter image description here"> Reference : <ul> <li>(1) C. Dyer, "Notes on Noise Contrastive Estimation and Negative Sampling", 2014</li> <li>(2) http://sebastianruder.com/word-embeddings-softmax/ </li> </ul>

word2vec: negative sampling (in layman term)?

2 Answers

The idea of word2vec is to maximise the similarity (dot product) between the vectors for words which appear close together (in the context of each other) in text, and minimise the similarity of words that do not. In equation (3) of the paper you link to, ignore the exponentiation for a moment. You have

      v_c . v_w  -------------------    sum_i(v_ci . v_w)

The numerator is basically the similarity between words c (the context) and w (the target) word. The denominator computes the similarity of all other contexts ci and the target word w. Maximising this ratio ensures words that appear closer together in text have more similar vectors than words that do not. However, computing this can be very slow, because there are many contexts ci. Negative sampling is one of the ways of addressing this problem- just select a couple of contexts ci at random. The end result is that if cat appears in the context of food, then the vector of food is more similar to the vector of cat (as measures by their dot product) than the vectors of several other randomly chosen words (e.g. democracy, greed, Freddy), instead of all other words in language. This makes word2vec much much faster to train.

162

answered Sep 25 '22 21:09

mbatchkarov

Computing Softmax (Function to determine which words are similar to the current target word) is expensive since requires summing over all words in V (denominator), which is generally very large.

enter image description here

What can be done?

Different strategies have been proposed to approximate the softmax. These approaches can be grouped into softmax-based and sampling-based approaches. Softmax-based approaches are methods that keep the softmax layer intact, but modify its architecture to improve its efficiency (e.g hierarchical softmax). Sampling-based approaches on the other hand completely do away with the softmax layer and instead optimise some other loss function that approximates the softmax (They do this by approximating the normalization in the denominator of the softmax with some other loss that is cheap to compute like negative sampling).

The loss function in Word2vec is something like:

enter image description here

Which logarithm can decompose into:

enter image description here

With some mathematic and gradient formula (See more details at 6) it converted to:

enter image description here

As you see it converted to binary classification task (y=1 positive class, y=0 negative class). As we need labels to perform our binary classification task, we designate all context words c as true labels (y=1, positive sample), and k randomly selected from corpora as false labels (y=0, negative sample).

Look at the following paragraph. Assume our target word is "Word2vec". With window of 3, our context words are: The, widely, popular, algorithm, was, developed. These context words consider as positive labels. We also need some negative labels. We randomly pick some words from corpus (produce, software, Collobert, margin-based, probabilistic) and consider them as negative samples. This technique that we picked some randomly example from corpus is called negative sampling.

enter image description here

Reference :

(1) C. Dyer, "Notes on Noise Contrastive Estimation and Negative Sampling", 2014
(2) http://sebastianruder.com/word-embeddings-softmax/

answered Sep 23 '22 21:09

Amir

Related questions
                            
                                pytorch - connection between loss.backward() and optimizer.step()
                            
                                Sentiment analysis for Twitter in Python [closed]
                            
                                Recovering features names of explained_variance_ratio_ in PCA with sklearn
                            
                                Accuracy Score ValueError: Can't Handle mix of binary and continuous target
                            
                                cocktail party algorithm SVD implementation ... in one line of code?
                            
                                scikit-learn .predict() default threshold
                            
                                What is the intuition of using tanh in LSTM? [closed]
                            
                                RuntimeError: Input type (torch.FloatTensor) and weight type (torch.cuda.FloatTensor) should be the same
                            
                                How to create a new gym environment in OpenAI?
                            
                                keras: how to save the training history attribute of the history object
                            
                                How to get Tensorflow tensor dimensions (shape) as int values?
                            
                                What is machine learning? [closed]
                            
                                What is the difference between np.mean and tf.reduce_mean?
                            
                                Keras: Difference between Kernel and Activity regularizers
                            
                                Understanding min_df and max_df in scikit CountVectorizer
                            
                                What is the role of TimeDistributed layer in Keras?
                            
                                Error in Python script "Expected 2D array, got 1D array instead:"?
                            
                                What is the mAP metric and how is it calculated? [closed]
                            
                                Common causes of nans during training
                            
                                Python: tf-idf-cosine: to find document similarity

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

word2vec: negative sampling (in layman term)?

Tags:

machine-learning

nlp

word2vec

Andy K

People also ask

2 Answers

mbatchkarov

Amir

Recent Activity

Donate For Us