From my understanding, negative sampling randomly samples K negative samples from a noise distribution, P(w). The noise distribution is basically the frequency distribution + some modification on words. Typically we choose K = 5 ~ 20 negative samples.
P(w) = Uw(w)^(3/4) / normalization_factor
And I've seen these two same equations that are represented in two different notations:

Three questions:
j and i?k in red box?Let's say that the normalized noise distribution looks as the following dictionary:
dist = {'apple': 0.0023, 'bee': 0.004, 'desk':0.032', 'chair': 0.032 ...}
How do you "randomly draw" K noise samples from dist?
I figured this out and wrote a tutorial article about negative sampling.
u_j comes from the noise distribution P_n(w). u_j is the i-th negative sample from the noise distribution, and also at the same time, j-th word vector in the output weight matrix.np.random.choice()The original cost function given in the original Word2Vec paper is actually quite confusing in terms of notations. A clearer form of cost function would be:

where c_pos is the word vector for positive word, and h is the hidden layer, and is equivalent to the word vector of the input word w. c_neg is the word vector of randomly drawn negative words, and W_neg is the word vector of all K negative words.
Noise distribution is normalized frequency distribution of words raised to the power of α. Mathematically, it can be expressed as:

A distribution of words based on how many times each word appeared in a corpus is called unigram distribution, and denoted as U(w). Z is a normalization factor, and α is a hyper-parameter that is typically α=3/4.
Raising the distribution to the power of α has an effect of smoothing out the distribution:

It attempts to combat the imbalance between common words and rare words by decreasing the probability of drawing common words, and increasing the probability drawing rare words.
Negative samples are randomly drawn from the noise distribution:
import numpy as np
unig_dist = {'apple': 0.023, 'bee': 0.12, 'desk': 0.34, 'chair': 0.517}
sum(unig_dist.values())
>>> 1.0
alpha = 3 / 4
noise_dist = {key: val ** alpha for key, val in unig_dist.items()}
Z = sum(noise_dist.values())
noise_dist_normalized = {key: val / Z for key, val in noise_dist.items()}
noise_dist_normalized
>>> {'apple': 0.044813853132981724,
'bee': 0.15470428538870049,
'desk': 0.33785130228003507,
'chair': 0.4626305591982827}
Initially, chair was the most common word and had the probability of being drawn 0.517. After the unigram distribution U(w) was raised to the power of 3/4, it has the probability of 0.463.
On the other hand, apple was the least common word of probability 0.023, but after transformation it has the probability of 0.045. The imbalance between the most common word (chair) and the least common word (apple) was mitigated.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With