Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert word2vec bin file to text

From the word2vec site I can download GoogleNews-vectors-negative300.bin.gz. The .bin file (about 3.4GB) is a binary format not useful to me. Tomas Mikolov assures us that "It should be fairly straightforward to convert the binary format to text format (though that will take more disk space). Check the code in the distance tool, it's rather trivial to read the binary file." Unfortunately, I don't know enough C to understand http://word2vec.googlecode.com/svn/trunk/distance.c.

Supposedly gensim can do this also, but all the tutorials I've found seem to be about converting from text, not the other way.

Can someone suggest modifications to the C code or instructions for gensim to emit text?

like image 881
Glenn Avatar asked Dec 05 '14 20:12

Glenn


People also ask

How is word2vec trained?

The word2vec algorithm uses a neural network model to learn word associations from a large corpus of text. Once trained, such a model can detect synonymous words or suggest additional words for a partial sentence.


1 Answers

I use this code to load binary model, then save the model to text file,

from gensim.models.keyedvectors import KeyedVectors  model = KeyedVectors.load_word2vec_format('path/to/GoogleNews-vectors-negative300.bin', binary=True) model.save_word2vec_format('path/to/GoogleNews-vectors-negative300.txt', binary=False) 

References: API and nullege.

Note:

Above code is for new version of gensim. For previous version, I used this code:

from gensim.models import word2vec  model = word2vec.Word2Vec.load_word2vec_format('path/to/GoogleNews-vectors-negative300.bin', binary=True) model.save_word2vec_format('path/to/GoogleNews-vectors-negative300.txt', binary=False) 
like image 196
silo Avatar answered Sep 20 '22 01:09

silo