Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to access/use Google's pre-trained Word2Vec model without manually downloading the model?

I want to analyse some text on a Google Compute server on Google Cloud Platform (GCP) using the Word2Vec model.

However, the un-compressed word2vec model from https://mccormickml.com/2016/04/12/googles-pretrained-word2vec-model-in-python/ is over 3.5GB and it will take time to download it manually and upload it to a cloud instance.

Is there any way to access this (or any other) pre-trained Word2Vec model on a Google Compute server without uploading it myself?

like image 368
Scott Vinay Avatar asked Sep 18 '19 03:09

Scott Vinay


2 Answers

Alternative to manually downloading stuff, you can use the pre-packaged version (third-party not from Google) on Kaggle dataset.

First sign up for Kaggle and get the credentials https://github.com/Kaggle/kaggle-api#api-credentials

Then, do this on the command line:

pip3 install kaggle
mkdir -p /content/.kaggle/
echo '{"username":"****","key":"****"}' > $HOME/.kaggle/kaggle.json
chmod 600 /root/.kaggle/kaggle.json
kaggle datasets download alvations/vegetables-google-word2vec
unzip $HOME/content/vegetables-google-word2vec.zip

Finally, in Python:

import pickle 
import numpy as np
import os

home = os.environ["HOME"]
embeddings = np.load(os.path.join(home, 'content/word2vec.news.negative-sample.300d.npy'))
with open(os.path.join(home, 'content/word2vec.news.negative-sample.300d.txt')) as fp:
    tokens = [line.strip() for line in fp]
embeddings[tokens.index('hello')]

Full example on Colab: https://colab.research.google.com/drive/178WunB1413VE2SHe5d5gc0pqAd5v6Cpl


P/S: To access other pre-packed word embeddings, see https://github.com/alvations/vegetables

like image 40
alvas Avatar answered Sep 20 '22 01:09

alvas


You can also use Gensim to download them through the downloader api:

import gensim.downloader as api
path = api.load("word2vec-google-news-300", return_path=True)
print(path)

or from the command line:

python -m gensim.downloader --download <dataname> # same as api.load(dataname, return_path=True)

for a list of available datasets check: https://github.com/RaRe-Technologies/gensim-data

like image 160
Fra Avatar answered Sep 21 '22 01:09

Fra