I'm trying to log a UTF-8 encoded string to a file using Python's logging package. As a toy example:
import logging def logging_test(): handler = logging.FileHandler("/home/ted/logfile.txt", "w", encoding = "UTF-8") formatter = logging.Formatter("%(message)s") handler.setFormatter(formatter) root_logger = logging.getLogger() root_logger.addHandler(handler) root_logger.setLevel(logging.INFO) # This is an o with a hat on it. byte_string = '\xc3\xb4' unicode_string = unicode("\xc3\xb4", "utf-8") print "printed unicode object: %s" % unicode_string # Explode root_logger.info(unicode_string) if __name__ == "__main__": logging_test()
This explodes with UnicodeDecodeError on the logging.info() call.
At a lower level, Python's logging package is using the codecs package to open the log file, passing in the "UTF-8" argument as the encoding. That's all well and good, but it's trying to write byte strings to the file instead of unicode objects, which explodes. Essentially, Python is doing this:
file_handler.write(unicode_string.encode("UTF-8"))
When it should be doing this:
file_handler.write(unicode_string)
Is this a bug in Python, or am I taking crazy pills? FWIW, this is a stock Python 2.6 installation.
UTF-8 is a byte oriented encoding. The encoding specifies that each character is represented by a specific sequence of one or more bytes.
Having code like:
raise Exception(u'щ')
Caused:
File "/usr/lib/python2.7/logging/__init__.py", line 467, in format s = self._fmt % record.__dict__ UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128)
This happens because the format string is a byte string, while some of the format string arguments are unicode strings with non-ASCII characters:
>>> "%(message)s" % {'message': Exception(u'\u0449')} *** UnicodeEncodeError: 'ascii' codec can't encode character u'\u0449' in position 0: ordinal not in range(128)
Making the format string unicode fixes the issue:
>>> u"%(message)s" % {'message': Exception(u'\u0449')} u'\u0449'
So, in your logging configuration make all format string unicode:
'formatters': { 'simple': { 'format': u'%(asctime)-s %(levelname)s [%(name)s]: %(message)s', 'datefmt': '%Y-%m-%d %H:%M:%S', }, ...
And patch the default logging
formatter to use unicode format string:
logging._defaultFormatter = logging.Formatter(u"%(message)s")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With