Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Error: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte

I am trying to do the following kaggle assignmnet. I am using gensim package to use word2vec. I am able to create the model and store it to disk. But when I am trying to load the file back I am getting the error below.

    -HP-dx2280-MT-GR541AV:~$ python prog_w2v.py 
Traceback (most recent call last):
  File "prog_w2v.py", line 7, in <module>
    models = gensim.models.Word2Vec.load_word2vec_format('300features_40minwords_10context.txt', binary=True)
  File "/usr/local/lib/python2.7/dist-packages/gensim/models/word2vec.py", line 579, in load_word2vec_format
    header = utils.to_unicode(fin.readline())
  File "/usr/local/lib/python2.7/dist-packages/gensim/utils.py", line 190, in any2unicode
    return unicode(text, encoding, errors=errors)
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte

I find similar question. But I was unable to solve the problem. My prog_w2v.py is as below.

import gensim
import time
start = time.time()    
models = gensim.models.Word2Vec.load_word2vec_format('300features_40minwords_10context.txt', binary=True) 
end = time.time()   
print end-start,"   seconds"

I am trying to generate the model using code here. The program takes about half an hour to generate the model. Hence I am unable to run it many times to debug it.

like image 519
user168983 Avatar asked Dec 26 '14 17:12

user168983


People also ask

What is an invalid start byte?

The Python "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte" occurs when we specify an incorrect encoding when decoding a bytes object. To solve the error, specify the correct encoding, e.g. utf-16 or open the file in binary mode ( rb or wb ).

How do you decode bytes in Python?

decode() is used to decode bytes to a string object. Decoding to a string object depends on the specified arguments. It also allows us to mention an error handling scheme to use for seconding errors. Note: bytes is a built-in binary sequence type in Python.


2 Answers

If you save your model with:

model.wv.save(OUTPUT_FILE_PATH + 'word2vec.bin')

Then load word2vec with load_word2vec_format method would cause the issue. To make it work you should use:

wiki_model = KeyedVectors.load(OUTPUT_FILE_PATH + 'word2vec.bin')

The same thing also happen when you save model with:

 model.wv.save_word2vec_format(OUTPUT_FILE_PATH + 'word2vec.txt', binary=False)

And then, want to load with KeyedVectors.load method. In this situation, use:

wiki_model = KeyedVectors.load_word2vec_format(OUTPUT_FILE_PATH + 'word2vec.bin', binary=False)
like image 172
Amir Avatar answered Oct 18 '22 22:10

Amir


If you saved your model with save(), you must use load()

load_word2vec_format is for the model generated by google, not for the model generated by gensim

like image 30
Mostafa Avatar answered Oct 18 '22 20:10

Mostafa