Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NLP - negative sampling - how to draw negative samples from noise distribution?

Tags:

python

nlp

From my understanding, negative sampling randomly samples K negative samples from a noise distribution, P(w). The noise distribution is basically the frequency distribution + some modification on words. Typically we choose K = 5 ~ 20 negative samples.

P(w) = Uw(w)^(3/4) / normalization_factor

And I've seen these two same equations that are represented in two different notations:

enter image description here

Three questions:

  1. What is the meaning of the blue box? What is the significance of j and i?
  2. The second equation does not seem to show anything that "randomly draws" words from the noise distribution. What is the meaning of k in red box?
  3. How do you choose noise samples from the noise distribution?

Let's say that the normalized noise distribution looks as the following dictionary:

dist = {'apple': 0.0023, 'bee': 0.004, 'desk':0.032', 'chair': 0.032 ...}

How do you "randomly draw" K noise samples from dist?

like image 266
Eric Kim Avatar asked Oct 30 '25 00:10

Eric Kim


1 Answers

I figured this out and wrote a tutorial article about negative sampling.

  1. The blue box means that u_j comes from the noise distribution P_n(w).
  2. The blue box incorporates the "randomly draws" aspect of negative sampling. u_j is the i-th negative sample from the noise distribution, and also at the same time, j-th word vector in the output weight matrix.
  3. You use something like np.random.choice()

The original cost function given in the original Word2Vec paper is actually quite confusing in terms of notations. A clearer form of cost function would be:

enter image description here

where c_pos is the word vector for positive word, and h is the hidden layer, and is equivalent to the word vector of the input word w. c_neg is the word vector of randomly drawn negative words, and W_neg is the word vector of all K negative words.

Noise distribution is normalized frequency distribution of words raised to the power of α. Mathematically, it can be expressed as:

enter image description here

A distribution of words based on how many times each word appeared in a corpus is called unigram distribution, and denoted as U(w). Z is a normalization factor, and α is a hyper-parameter that is typically α=3/4.

Raising the distribution to the power of α has an effect of smoothing out the distribution:

enter image description here

It attempts to combat the imbalance between common words and rare words by decreasing the probability of drawing common words, and increasing the probability drawing rare words.

Negative samples are randomly drawn from the noise distribution:

import numpy as np

unig_dist  = {'apple': 0.023, 'bee': 0.12, 'desk': 0.34, 'chair': 0.517}

sum(unig_dist.values())
>>> 1.0

alpha      = 3 / 4

noise_dist = {key: val ** alpha for key, val in unig_dist.items()}
Z = sum(noise_dist.values())
noise_dist_normalized = {key: val / Z for key, val in noise_dist.items()}
noise_dist_normalized
>>> {'apple': 0.044813853132981724,
 'bee': 0.15470428538870049,
 'desk': 0.33785130228003507,
 'chair': 0.4626305591982827}

Initially, chair was the most common word and had the probability of being drawn 0.517. After the unigram distribution U(w) was raised to the power of 3/4, it has the probability of 0.463.

On the other hand, apple was the least common word of probability 0.023, but after transformation it has the probability of 0.045. The imbalance between the most common word (chair) and the least common word (apple) was mitigated.

like image 60
Eric Kim Avatar answered Oct 31 '25 15:10

Eric Kim



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!