I am trying to do the following kaggle assignmnet. I am using gensim package to use word2vec. I am able to create the model and store it to disk. But when I am trying to load the file back I am getting the error below.
-HP-dx2280-MT-GR541AV:~$ python prog_w2v.py
Traceback (most recent call last):
File "prog_w2v.py", line 7, in <module>
models = gensim.models.Word2Vec.load_word2vec_format('300features_40minwords_10context.txt', binary=True)
File "/usr/local/lib/python2.7/dist-packages/gensim/models/word2vec.py", line 579, in load_word2vec_format
header = utils.to_unicode(fin.readline())
File "/usr/local/lib/python2.7/dist-packages/gensim/utils.py", line 190, in any2unicode
return unicode(text, encoding, errors=errors)
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte
I find similar question. But I was unable to solve the problem. My prog_w2v.py is as below.
import gensim
import time
start = time.time()
models = gensim.models.Word2Vec.load_word2vec_format('300features_40minwords_10context.txt', binary=True)
end = time.time()
print end-start," seconds"
I am trying to generate the model using code here. The program takes about half an hour to generate the model. Hence I am unable to run it many times to debug it.
The Python "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte" occurs when we specify an incorrect encoding when decoding a bytes object. To solve the error, specify the correct encoding, e.g. utf-16 or open the file in binary mode ( rb or wb ).
decode() is used to decode bytes to a string object. Decoding to a string object depends on the specified arguments. It also allows us to mention an error handling scheme to use for seconding errors. Note: bytes is a built-in binary sequence type in Python.
If you save your model with:
model.wv.save(OUTPUT_FILE_PATH + 'word2vec.bin')
Then load word2vec with load_word2vec_format
method would cause the issue. To make it work you should use:
wiki_model = KeyedVectors.load(OUTPUT_FILE_PATH + 'word2vec.bin')
The same thing also happen when you save model with:
model.wv.save_word2vec_format(OUTPUT_FILE_PATH + 'word2vec.txt', binary=False)
And then, want to load with KeyedVectors.load
method. In this situation, use:
wiki_model = KeyedVectors.load_word2vec_format(OUTPUT_FILE_PATH + 'word2vec.bin', binary=False)
If you saved your model with save(), you must use load()
load_word2vec_format is for the model generated by google, not for the model generated by gensim
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With