I have created a package using the encoding utf-8.
When calling a function, it returns a DataFrame
, with a column coded in utf-8.
When using IPython at the command line, I don't have any problems showing the content of this table. When using the Notebook, it crashes with the error 'utf8' codec can't decode byte 0xe7
. I've attached a full traceback below.
What is the proper encoding to work with Notebook?
UnicodeDecodeError Traceback (most recent call last) <ipython-input-13-92c0011919e7> in <module>() 3 ver = verif.VerificacaoNA() 4 comp, total = ver.executarCompRealFisica(DT_INI, DT_FIN) ----> 5 comp c:\Python27-32\lib\site-packages\ipython-0.13.1-py2.7.egg\IPython\core\displayhook.pyc in __call__(self, result) 240 self.update_user_ns(result) 241 self.log_output(format_dict) --> 242 self.finish_displayhook() 243 244 def flush(self): c:\Python27-32\lib\site-packages\ipython-0.13.1-py2.7.egg\IPython\zmq\displayhook.pyc in finish_displayhook(self) 59 sys.stdout.flush() 60 sys.stderr.flush() ---> 61 self.session.send(self.pub_socket, self.msg, ident=self.topic) 62 self.msg = None 63 c:\Python27-32\lib\site-packages\ipython-0.13.1-py2.7.egg\IPython\zmq\session.pyc in send(self, stream, msg_or_type, content, parent, ident, buffers, subheader, track, header) 557 558 buffers = [] if buffers is None else buffers --> 559 to_send = self.serialize(msg, ident) 560 flag = 0 561 if buffers: c:\Python27-32\lib\site-packages\ipython-0.13.1-py2.7.egg\IPython\zmq\session.pyc in serialize(self, msg, ident) 461 content = self.none 462 elif isinstance(content, dict): --> 463 content = self.pack(content) 464 elif isinstance(content, bytes): 465 # content is already packed, as in a relayed message c:\Python27-32\lib\site-packages\ipython-0.13.1-py2.7.egg\IPython\zmq\session.pyc in <lambda>(obj) 76 77 # ISO8601-ify datetime objects ---> 78 json_packer = lambda obj: jsonapi.dumps(obj, default=date_default) 79 json_unpacker = lambda s: extract_dates(jsonapi.loads(s)) 80 c:\Python27-32\lib\site-packages\pyzmq-13.0.0-py2.7-win32.egg\zmq\utils\jsonapi.pyc in dumps(o, **kwargs) 70 kwargs['separators'] = (',', ':') 71 ---> 72 return _squash_unicode(jsonmod.dumps(o, **kwargs)) 73 74 def loads(s, **kwargs): c:\Python27-32\lib\json\__init__.pyc in dumps(obj, skipkeys, ensure_ascii, check_circular, allow_nan, cls, indent, separators, encoding, default, **kw) 236 check_circular=check_circular, allow_nan=allow_nan, indent=indent, 237 separators=separators, encoding=encoding, default=default, --> 238 **kw).encode(obj) 239 240 c:\Python27-32\lib\json\encoder.pyc in encode(self, o) 199 # exceptions aren't as detailed. The list call should be roughly 200 # equivalent to the PySequence_Fast that ''.join() would do. --> 201 chunks = self.iterencode(o, _one_shot=True) 202 if not isinstance(chunks, (list, tuple)): 203 chunks = list(chunks) c:\Python27-32\lib\json\encoder.pyc in iterencode(self, o, _one_shot) 262 self.key_separator, self.item_separator, self.sort_keys, 263 self.skipkeys, _one_shot) --> 264 return _iterencode(o, 0) 265 266 def _make_iterencode(markers, _default, _encoder, _indent, _floatstr, UnicodeDecodeError: 'utf8' codec can't decode byte 0xe7 in position 199: invalid continuation byte
UTF-8 is one of the most commonly used encodings, and Python often defaults to using it. UTF stands for “Unicode Transformation Format”, and the '8' means that 8-bit values are used in the encoding. (There are also UTF-16 and UTF-32 encodings, but they are less frequently used than UTF-8.)
Show activity on this post. The way I read the spec, UTF-8 is not the default encoding in an XML declaration. It is only the default encoding "for an entity which begins with neither a Byte Order Mark nor an encoding declaration".
Jupyter (né IPython) notebook files are simple JSON documents, containing text, source code, rich media output, and metadata. each segment of the document is stored in a cell.
As a content author or developer, you should nowadays always choose the UTF-8 character encoding for your content or data. This Unicode encoding is a good choice because you can use a single character encoding to handle any character you are likely to need. This greatly simplifies things.
I had the same problem recently, and indeed setting the default encoding to UTF-8 did the trick:
import sys reload(sys) sys.setdefaultencoding("utf-8")
Running sys.getdefaultencoding()
yielded 'ascii'
on my environment (Python 2.7.3), so I guess that's the default.
Also see this related question and Ian Bicking's blog post on the subject.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With