Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Visualise word2vec generated from gensim

Tags:

I have trained a doc2vec and corresponding word2vec on my own corpus using gensim. I want to visualise the word2vec using t-sne with the words. As in, each dot in the figure has the "word" also with it.

I looked at a similar question here : t-sne on word2vec

Following it, I have this code :

import gensim import gensim.models as g

from sklearn.manifold import TSNE
import re
import matplotlib.pyplot as plt

modelPath="/Users/tarun/Desktop/PE/doc2vec/model3_100_newCorpus60_1min_6window_100trainEpoch.bin"
model = g.Doc2Vec.load(modelPath)

X = model[model.wv.vocab]
print len(X)
print X[0]
tsne = TSNE(n_components=2)
X_tsne = tsne.fit_transform(X[:1000,:])

plt.scatter(X_tsne[:, 0], X_tsne[:, 1])
plt.show()

This gives a figure with dots but no words. That is I don't know which dot is representative of which word. How can I display the word with the dot?

like image 576
Dreams Avatar asked May 04 '17 07:05

Dreams


People also ask

How do I visualize Word2Vec embeds?

Visualize the word embedding by creating a 3-D text scatter plot using tsne and textscatter . Convert the first 5000 words to vectors using word2vec . V is a matrix of word vectors of length 300. Embed the word vectors in a three-dimensional space using tsne by specifying the number of dimensions to be three.

Is Word2Vec part of gensim?

Introduces Gensim's Word2Vec model and demonstrates its use on the Lee Evaluation Corpus. In case you missed the buzz, Word2Vec is a widely used algorithm based on neural networks, commonly referred to as “deep learning” (though word2vec itself is rather shallow).

How do I test a Word2Vec model?

To assess which word2vec model is best, simply calculate the distance for each pair, do it 200 times, sum up the total distance, and the smallest total distance will be your best model.

How does gensim Word2Vec work?

Word embeddings work by using an algorithm to train a set of fixed-length dense and continuous-valued vectors based on a large corpus of text. Each word is represented by a point in the embedding space and these points are learned and moved around based on the words that surround the target word.

How to build word2vec word embedding model with Gensim?

The goal is to build Word2Vec Word Embedding model with the help of Gensim. Gensim is fairly easy to use module which inherits CBOW and Skip-gram. We can install it by using !pip install gensim in Jupyter Notebook. Alternate way to implement Word2Vec is to build it from scratch which is quite complex.

What is the word2vec model?

The Word2Vec contains two models for training Skip-Gram model and continuous bag of words (CBOW). For people who want to get familiar with the basic concepts of word embedding, they should first review the articles given below. In this article, we will focus mostly on python implementation and visualization. What is a word embedding?

What is word2vec algorithm?

Word2Vec is an algorithm that converts a word into vectors such that it groups similar words together into vector space. It is widely used in many applications like document retrieval, machine translation systems, autocompletion and prediction etc.

What is LSTM and word2vec model?

LSTM, Bi-directional LSTM etc model helps get the vector form of words. Word2Vec Model: Word2Vec is a method to construct such a static word embedding. This has been achieved using two methods Skip Gram and Common Bag of Words (CBOW) with help of neural networks. It was developed by Tomas Mikolov in 2013 at Google. Why do we need them?


1 Answers

Two parts to the answer: how to get the word labels, and how to plot the labels on a scatterplot.

Word labels in gensim's word2vec

model.wv.vocab is a dict of {word: object of numeric vector}. To load the data into X for t-SNE, I made one change.

vocab = list(model.wv.key_to_index)
X = model.wv[vocab]

This accomplishes two things: (1) it gets you a standalone vocab list for the final dataframe to plot, and (2) when you index model, you can be sure that you know the order of the words.

Proceed as before with

tsne = TSNE(n_components=2)
X_tsne = tsne.fit_transform(X)

Now let's put X_tsne together with the vocab list. This is easy with pandas, so import pandas as pd if you don't have that yet.

df = pd.DataFrame(X_tsne, index=vocab, columns=['x', 'y'])

The vocab words are the indices of the dataframe now.

I don't have your dataset, but in the other SO you mentioned, an example df that uses sklearn's newsgroups would look something like

                        x             y
politics    -1.524653e+20 -1.113538e+20
worry        2.065890e+19  1.403432e+20
mu          -1.333273e+21 -5.648459e+20
format      -4.780181e+19  2.397271e+19
recommended  8.694375e+20  1.358602e+21
arguing     -4.903531e+19  4.734511e+20
or          -3.658189e+19 -1.088200e+20
above        1.126082e+19 -4.933230e+19

Scatterplot

I like the object-oriented approach to matplotlib, so this starts out a little different.

fig = plt.figure()
ax = fig.add_subplot(1, 1, 1)

ax.scatter(df['x'], df['y'])

Lastly, the annotate method will label coordinates. The first two arguments are the text label and the 2-tuple. Using iterrows(), this can be very succinct:

for word, pos in df.iterrows():
    ax.annotate(word, pos)

[Thanks to Ricardo in the comments for this suggestion.]

Then do plt.show() or fig.savefig(). Depending on your data, you'll probably have to mess with ax.set_xlim and ax.set_ylim to see into a dense cloud. This is the newsgroup example without any tweaking:

scatterplot

You can modify dot size, color, etc., too. Happy fine-tuning!

like image 72
Niels Joaquin Avatar answered Sep 21 '22 13:09

Niels Joaquin