I want to analyse some text on a Google Compute server on Google Cloud Platform (GCP) using the Word2Vec model.
However, the un-compressed word2vec model from https://mccormickml.com/2016/04/12/googles-pretrained-word2vec-model-in-python/ is over 3.5GB and it will take time to download it manually and upload it to a cloud instance.
Is there any way to access this (or any other) pre-trained Word2Vec model on a Google Compute server without uploading it myself?
Alternative to manually downloading stuff, you can use the pre-packaged version (third-party not from Google) on Kaggle dataset.
First sign up for Kaggle and get the credentials https://github.com/Kaggle/kaggle-api#api-credentials
Then, do this on the command line:
pip3 install kaggle
mkdir -p /content/.kaggle/
echo '{"username":"****","key":"****"}' > $HOME/.kaggle/kaggle.json
chmod 600 /root/.kaggle/kaggle.json
kaggle datasets download alvations/vegetables-google-word2vec
unzip $HOME/content/vegetables-google-word2vec.zip
Finally, in Python:
import pickle
import numpy as np
import os
home = os.environ["HOME"]
embeddings = np.load(os.path.join(home, 'content/word2vec.news.negative-sample.300d.npy'))
with open(os.path.join(home, 'content/word2vec.news.negative-sample.300d.txt')) as fp:
tokens = [line.strip() for line in fp]
embeddings[tokens.index('hello')]
Full example on Colab: https://colab.research.google.com/drive/178WunB1413VE2SHe5d5gc0pqAd5v6Cpl
P/S: To access other pre-packed word embeddings, see https://github.com/alvations/vegetables
You can also use Gensim to download them through the downloader api:
import gensim.downloader as api
path = api.load("word2vec-google-news-300", return_path=True)
print(path)
or from the command line:
python -m gensim.downloader --download <dataname> # same as api.load(dataname, return_path=True)
for a list of available datasets check: https://github.com/RaRe-Technologies/gensim-data
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With