PCA on word2vec embeddings

Question

I am trying to reproduce the results of this paper: https://arxiv.org/pdf/1607.06520.pdf

Specifically this part:

To identify the gender subspace, we took the ten gender pair difference vectors and computed its principal components (PCs). As Figure 6 shows, there is a single direction that explains the majority of variance in these vectors. The first eigenvalue is significantly larger than the rest.

enter image description here

I am using the same set of word vectors as the authors (Google News Corpus, 300 dimensions), which I load into word2vec.

The 'ten gender pair difference vectors' the authors refer to are computed from the following word pairs:

enter image description here

I've computed the differences between each normalized vector in the following way:

model = gensim.models.KeyedVectors.load_word2vec_format('GoogleNews-vectors-
negative300.bin', binary = True)
model.init_sims()

pairs = [('she', 'he'),
('her', 'his'),
('woman', 'man'),
('Mary', 'John'),
('herself', 'himself'),
('daughter', 'son'),
('mother', 'father'),
('gal', 'guy'),
('girl', 'boy'),
('female', 'male')]

difference_matrix = np.array([model.word_vec(a[0], use_norm=True) - model.word_vec(a[1], use_norm=True) for a in pairs])

I then perform PCA on the resulting matrix, with 10 components, as per the paper:

from sklearn.decomposition import PCA
pca = PCA(n_components=10)
pca.fit(difference_matrix)

However I get very different results when I look at pca.explained_variance_ratio_ :

array([  2.83391436e-01,   2.48616155e-01,   1.90642492e-01,
         9.98411858e-02,   5.61260498e-02,   5.29706681e-02,
         2.75670634e-02,   2.21957722e-02,   1.86491774e-02,
         1.99108478e-32])

or with a chart:

enter image description here

The first component accounts for less than 30% of the variance when it should be above 60%!

The results I get are similar to what I get when I try to do the PCA on randomly selected vectors, so I must be doing something wrong, but I can't figure out what.

Note: I've tried without normalizing the vectors, but I get the same results.

oregano · Accepted Answer

They released the code for the paper on github: https://github.com/tolga-b/debiaswe

Specifically, you can see their code for creating the PCA plot in this file.

Here is the relevant snippet of code from that file:

def doPCA(pairs, embedding, num_components = 10):
    matrix = []
    for a, b in pairs:
        center = (embedding.v(a) + embedding.v(b))/2
        matrix.append(embedding.v(a) - center)
        matrix.append(embedding.v(b) - center)
    matrix = np.array(matrix)
    pca = PCA(n_components = num_components)
    pca.fit(matrix)
    # bar(range(num_components), pca.explained_variance_ratio_)
    return pca

Based on the code, looks like they are taking the difference between each word in a pair and the average vector of the pair. To me, it's not clear this is what they meant in the paper. However, I ran this code with their pairs and was able to recreate the graph from the paper:

enter image description here

jnaf · Answer

To expand on oregano's answer:

For each pair, a and b, they calculate the center, c = (a + b) / 2 and then include vectors pointing in both directions, a - c and b - c.

The reason this is critical is that PCA gives you the vector along which the most variance occurs. All of your vectors point in the same direction, so there is very little variance in precisely the direction you are trying to reveal.

Their set includes vectors pointing in both directions in the gender subspace, so PCA clearly reveals gender variation.

PCA on word2vec embeddings

Tags:

python

nlp

scikit-learn

word2vec

pca

user2969402

2 Answers

oregano

jnaf

Recent Activity

Donate For Us

PCA on word2vec embeddings

Tags:

python

nlp

scikit-learn

word2vec

pca

user2969402

2 Answers

oregano

jnaf

Related questions

Recent Activity

Donate For Us