Error: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte

Tags:

I am trying to do the following kaggle assignmnet. I am using gensim package to use word2vec. I am able to create the model and store it to disk. But when I am trying to load the file back I am getting the error below.

    -HP-dx2280-MT-GR541AV:~$ python prog_w2v.py 
Traceback (most recent call last):
  File "prog_w2v.py", line 7, in <module>
    models = gensim.models.Word2Vec.load_word2vec_format('300features_40minwords_10context.txt', binary=True)
  File "/usr/local/lib/python2.7/dist-packages/gensim/models/word2vec.py", line 579, in load_word2vec_format
    header = utils.to_unicode(fin.readline())
  File "/usr/local/lib/python2.7/dist-packages/gensim/utils.py", line 190, in any2unicode
    return unicode(text, encoding, errors=errors)
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte

I find similar question. But I was unable to solve the problem. My prog_w2v.py is as below.

import gensim
import time
start = time.time()    
models = gensim.models.Word2Vec.load_word2vec_format('300features_40minwords_10context.txt', binary=True) 
end = time.time()   
print end-start,"   seconds"

I am trying to generate the model using code here. The program takes about half an hour to generate the model. Hence I am unable to run it many times to debug it.

519

asked Dec 26 '14 17:12

user168983

2 Answers

If you save your model with:

model.wv.save(OUTPUT_FILE_PATH + 'word2vec.bin')

Then load word2vec with load_word2vec_format method would cause the issue. To make it work you should use:

wiki_model = KeyedVectors.load(OUTPUT_FILE_PATH + 'word2vec.bin')

The same thing also happen when you save model with:

 model.wv.save_word2vec_format(OUTPUT_FILE_PATH + 'word2vec.txt', binary=False)

And then, want to load with KeyedVectors.load method. In this situation, use:

wiki_model = KeyedVectors.load_word2vec_format(OUTPUT_FILE_PATH + 'word2vec.bin', binary=False)

172

answered Oct 18 '22 22:10

Amir

If you saved your model with save(), you must use load()

load_word2vec_format is for the model generated by google, not for the model generated by gensim

answered Oct 18 '22 20:10

Mostafa

Related questions
                            
                                Why doesn't this set comprehension work?
                            
                                Flask Python Model Validation
                            
                                assertRaises fails, even the callable raises the required exception (python, unitest)
                            
                                Python: iterating over list vs over dict items efficiency
                            
                                SqlAlchemy metaclass confusion
                            
                                Convert an integer to binary without using the built-in bin function
                            
                                Can I make matplotlib sliders more discrete?
                            
                                Sending a password over SSH or SCP with subprocess.Popen
                            
                                Generate correlated data in Python (3.3)
                            
                                How to join all the lines together in a text file in python?
                            
                                Python installation in Mac OS X virtual environment that includes a framework that I can include into Xcode?
                            
                                how to use a Python function with keyword "self" in arguments
                            
                                Installing win32gui python module [duplicate]
                            
                                Is a countvectorizer the same as tfidfvectorizer with use_idf=false?
                            
                                Embedding Python3 in Qt 5
                            
                                Calculating cumulative minimum with numpy arrays
                            
                                How to properly escape strings when manually building SQL queries in SQLAlchemy?
                            
                                Determine what project id my App Engine code is running on
                            
                                how to set autocommit = 1 in a sqlalchemy.engine.Connection
                            
                                'str' does not support the buffer interface Python3 from Python2

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Error: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte

Tags:

python

character-encoding

gensim

word2vec

kaggle

user168983

People also ask

2 Answers

Amir

Mostafa

Recent Activity

Donate For Us