I want to visualize a word2vec created from gensim library. I tried sklearn but it seems I need to install a developer version to get it. I tried installing the developer version but that is not working on my machine . Is it possible to modify this code to visualize a word2vec model ?
tsne_python
To assess which word2vec model is best, simply calculate the distance for each pair, do it 200 times, sum up the total distance, and the smallest total distance will be your best model.
The pre-trained Google word2vec model was trained on Google news data (about 100 billion words); it contains 3 million words and phrases and was fit using 300-dimensional word vectors. It is a 1.53 Gigabytes file. You can download it from here: GoogleNews-vectors-negative300.
Introduces Gensim's Word2Vec model and demonstrates its use on the Lee Evaluation Corpus. In case you missed the buzz, Word2Vec is a widely used algorithm based on neural networks, commonly referred to as “deep learning” (though word2vec itself is rather shallow).
The full model can be stored/loaded via its save() and load() methods. The trained word vectors can also be stored/loaded from a format compatible with the original word2vec implementation via self.
You don't need a developer version of scikit-learn - just install scikit-learn the usual way via pip or conda.
To access the word vectors created by word2vec simply use the word dictionary as index into the model:
X = model[model.wv.vocab]
Following is a simple but complete code example which loads some newsgroup data, applies very basic data preparation (cleaning and breaking up sentences), trains a word2vec model, reduces the dimensions with t-SNE, and visualizes the output.
from gensim.models.word2vec import Word2Vec
from sklearn.manifold import TSNE
from sklearn.datasets import fetch_20newsgroups
import re
import matplotlib.pyplot as plt
# download example data ( may take a while)
train = fetch_20newsgroups()
def clean(text):
"""Remove posting header, split by sentences and words, keep only letters"""
lines = re.split('[?!.:]\s', re.sub('^.*Lines: \d+', '', re.sub('\n', ' ', text)))
return [re.sub('[^a-zA-Z]', ' ', line).lower().split() for line in lines]
sentences = [line for text in train.data for line in clean(text)]
model = Word2Vec(sentences, workers=4, size=100, min_count=50, window=10, sample=1e-3)
print (model.wv.most_similar('memory'))
X = model.wv[model.wv.vocab]
tsne = TSNE(n_components=2)
X_tsne = tsne.fit_transform(X)
plt.scatter(X_tsne[:, 0], X_tsne[:, 1])
plt.show()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With