Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to run tsne on word2vec created from gensim?

I want to visualize a word2vec created from gensim library. I tried sklearn but it seems I need to install a developer version to get it. I tried installing the developer version but that is not working on my machine . Is it possible to modify this code to visualize a word2vec model ?

tsne_python

like image 762
Shakti Avatar asked Nov 14 '16 02:11

Shakti


People also ask

How do I test a Word2Vec model?

To assess which word2vec model is best, simply calculate the distance for each pair, do it 200 times, sum up the total distance, and the smallest total distance will be your best model.

What is gensim Word2Vec trained on?

The pre-trained Google word2vec model was trained on Google news data (about 100 billion words); it contains 3 million words and phrases and was fit using 300-dimensional word vectors. It is a 1.53 Gigabytes file. You can download it from here: GoogleNews-vectors-negative300.

Is Word2Vec part of gensim?

Introduces Gensim's Word2Vec model and demonstrates its use on the Lee Evaluation Corpus. In case you missed the buzz, Word2Vec is a widely used algorithm based on neural networks, commonly referred to as “deep learning” (though word2vec itself is rather shallow).

How do I install gensim Word2Vec model?

The full model can be stored/loaded via its save() and load() methods. The trained word vectors can also be stored/loaded from a format compatible with the original word2vec implementation via self.


1 Answers

You don't need a developer version of scikit-learn - just install scikit-learn the usual way via pip or conda.

To access the word vectors created by word2vec simply use the word dictionary as index into the model:

X = model[model.wv.vocab]

Following is a simple but complete code example which loads some newsgroup data, applies very basic data preparation (cleaning and breaking up sentences), trains a word2vec model, reduces the dimensions with t-SNE, and visualizes the output.

from gensim.models.word2vec import Word2Vec
from sklearn.manifold import TSNE
from sklearn.datasets import fetch_20newsgroups
import re
import matplotlib.pyplot as plt

# download example data ( may take a while)
train = fetch_20newsgroups()

def clean(text):
    """Remove posting header, split by sentences and words, keep only letters"""
    lines = re.split('[?!.:]\s', re.sub('^.*Lines: \d+', '', re.sub('\n', ' ', text)))
    return [re.sub('[^a-zA-Z]', ' ', line).lower().split() for line in lines]

sentences = [line for text in train.data for line in clean(text)]

model = Word2Vec(sentences, workers=4, size=100, min_count=50, window=10, sample=1e-3)

print (model.wv.most_similar('memory'))

X = model.wv[model.wv.vocab]

tsne = TSNE(n_components=2)
X_tsne = tsne.fit_transform(X)

plt.scatter(X_tsne[:, 0], X_tsne[:, 1])
plt.show()
like image 101
goerlitz Avatar answered Sep 19 '22 13:09

goerlitz