Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

'utf-8' codec can't decode byte 0x80

I'm trying to download BVLC-trained model and I'm stuck with this error

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 110: invalid start byte

I think it's because of the following function (complete code)

  # Closure-d function for checking SHA1.
  def model_checks_out(filename=model_filename, sha1=frontmatter['sha1']):
      with open(filename, 'r') as f:
          return hashlib.sha1(f.read()).hexdigest() == sha1

Any idea how to fix this?

like image 366
Ehab AlBadawy Avatar asked Apr 24 '16 16:04

Ehab AlBadawy


2 Answers

You are opening a file that is not UTF-8 encoded, while the default encoding for your system is set to UTF-8.

Since you are calculating a SHA1 hash, you should read the data as binary instead. The hashlib functions require you pass in bytes:

with open(filename, 'rb') as f:
    return hashlib.sha1(f.read()).hexdigest() == sha1

Note the addition of b in the file mode.

See the open() documentation:

mode is an optional string that specifies the mode in which the file is opened. It defaults to 'r' which means open for reading in text mode. [...] In text mode, if encoding is not specified the encoding used is platform dependent: locale.getpreferredencoding(False) is called to get the current locale encoding. (For reading and writing raw bytes use binary mode and leave encoding unspecified.)

and from the hashlib module documentation:

You can now feed this object with bytes-like objects (normally bytes) using the update() method.

like image 161
Martijn Pieters Avatar answered Sep 29 '22 08:09

Martijn Pieters


You didn't specify to open the file in binary mode, so f.read() is trying to read the file as a UTF-8-encoded text file, which doesn't seem to be working. But since we take the hash of bytes, not of strings, it doesn't matter what the encoding is, or even whether the file is text at all: just open it, and then read it, as a binary file.

>>> with open("test.h5.bz2","r") as f: print(hashlib.sha1(f.read()).hexdigest())
Traceback (most recent call last):
  File "<ipython-input-3-fdba09d5390b>", line 1, in <module>
    with open("test.h5.bz2","r") as f: print(hashlib.sha1(f.read()).hexdigest())
  File "/home/dsm/sys/pys/Python-3.5.1-bin/lib/python3.5/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb8 in position 10: invalid start byte

but

>>> with open("test.h5.bz2","rb") as f: print(hashlib.sha1(f.read()).hexdigest())
21bd89480061c80f347e34594e71c6943ca11325
like image 33
DSM Avatar answered Sep 29 '22 10:09

DSM