Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

'utf-8' decode error in tensorflow tutorial

I'm running into this bizarre problem where when I run

  from tensorflow.examples.tutorials.mnist import input_data

  mnist = input_data.read_data_sets('/home/fqiao/development/MNIST_data/', one_hot=True)

I get:

  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/examples/tutorials/mnist/input_data.py", line 199, in read_data_sets
    train_images = extract_images(local_file)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/examples/tutorials/mnist/input_data.py", line 58, in extract_images
    magic = _read32(bytestream)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/examples/tutorials/mnist/input_data.py", line 51, in _read32
    return numpy.frombuffer(bytestream.read(4), dtype=dt)[0]
  File "/usr/lib/python3.5/gzip.py", line 274, in read
    return self._buffer.read(size)
  File "/usr/lib/python3.5/_compression.py", line 68, in readinto
    data = self.read(len(byte_view))
  File "/usr/lib/python3.5/gzip.py", line 461, in read
    if not self._read_gzip_header():
  File "/usr/lib/python3.5/gzip.py", line 404, in _read_gzip_header
    magic = self._fp.read(2)
  File "/usr/lib/python3.5/gzip.py", line 91, in read
    self.file.read(size-self._length+read)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/default/_gfile.py", line 45, in sync
    return fn(self, *args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/default/_gfile.py", line 199, in read
    return self._fp.read(n)
  File "/usr/lib/python3.5/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

However, if I just run the code in input_data.py directly, everything appears to be fine:

>>> dt = numpy.dtype(numpy.uint32).newbyteorder('>')
>>> f = tf.gfile.Open('/home/fqiao/development/MNIST_data/train-images-idx3-ubyte.gz', 'rb')
>>> bytestream = gzip.GzipFile(fileobj=f)
>>> testbytes = numpy.frombuffer(bytestream.read(4), dtype=dt)[0]
>>> testbytes
2051

Anyone has any idea what's going on?

My system: Ubuntu 15.10 x64 python 3.5.0.

like image 682
Cescante Avatar asked Feb 19 '16 19:02

Cescante


2 Answers

The bug has been addressed by a recent change 555e73d. MNIST files need to be opened with binary 'rb' mode instead of just text 'r'.

like image 115
Cescante Avatar answered Nov 04 '22 15:11

Cescante


In my case, the problem was in the encoding of the data file.

Open the file using vim and execute:

:set fileencoding=utf-8

That solved the issue in my case.

like image 1
wael34218 Avatar answered Nov 04 '22 14:11

wael34218