Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Working with google word2vec .bin files in gensim python

I’m trying to get started by loading the pretrained .bin files from the google word2vec site ( freebase-vectors-skipgram1000.bin.gz) into the gensim implementation of word2vec. The model loads fine,

using ..

model = word2vec.Word2Vec.load_word2vec_format('...../free....-en.bin', binary= True)

and creates a

>>> print model
<gensim.models.word2vec.Word2Vec object at 0x105d87f50>

but when I run the most similar function. It cant find the words in the vocabulary. My error code is below.

Any ideas where I’m going wrong?

>>> model.most_similar(['girl', 'father'], ['boy'], topn=3)
2013-10-11 10:22:00,562 : WARNING : word ‘girl’ not in vocabulary; ignoring it
2013-10-11 10:22:00,562 : WARNING : word ‘father’ not in vocabulary; ignoring it
2013-10-11 10:22:00,563 : WARNING : word ‘boy’ not in vocabulary; ignoring it
Traceback (most recent call last):
File “”, line 1, in
File “/....../anaconda/python.app/Contents/lib/python2.7/site-packages/gensim-0.8.7/py2.7.egg/gensim/models/word2vec.py”, line 312, in most_similar
raise ValueError(“cannot compute similarity with no input”)
ValueError: cannot compute similarity with no input
like image 253
user2870492 Avatar asked Oct 11 '13 09:10

user2870492


2 Answers

The words in '...../free....-en.bin' have the form of

en/boardwalk_chapel en/mutsu_munemitsu en/goffstown en/yaw_axis en/john_e_fogarty_international_center en/francielle_manoel_alberto en/shinji_harada

So when you look for 'girl' it is not there

like image 144
Sergio Avatar answered Sep 30 '22 16:09

Sergio


To expand a bit on Sergio's answer, the "words" are actually Freebase identifiers, so "girl" is represented by either /en/girl (for freebase-vectors-skipgram1000-en.bin.gz) or its MID equivalent /m/05r655 (for freebase-vectors-skipgram1000.bin.gz)

https://www.freebase.com/m/05r655

https://www.freebase.com/en/girl

like image 29
Tom Morris Avatar answered Sep 30 '22 17:09

Tom Morris