Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

UTF-8 In Python logging, how?

I'm trying to log a UTF-8 encoded string to a file using Python's logging package. As a toy example:

import logging  def logging_test():     handler = logging.FileHandler("/home/ted/logfile.txt", "w",                                   encoding = "UTF-8")     formatter = logging.Formatter("%(message)s")     handler.setFormatter(formatter)     root_logger = logging.getLogger()     root_logger.addHandler(handler)     root_logger.setLevel(logging.INFO)      # This is an o with a hat on it.     byte_string = '\xc3\xb4'     unicode_string = unicode("\xc3\xb4", "utf-8")      print "printed unicode object: %s" % unicode_string      # Explode     root_logger.info(unicode_string)  if __name__ == "__main__":     logging_test() 

This explodes with UnicodeDecodeError on the logging.info() call.

At a lower level, Python's logging package is using the codecs package to open the log file, passing in the "UTF-8" argument as the encoding. That's all well and good, but it's trying to write byte strings to the file instead of unicode objects, which explodes. Essentially, Python is doing this:

file_handler.write(unicode_string.encode("UTF-8")) 

When it should be doing this:

file_handler.write(unicode_string) 

Is this a bug in Python, or am I taking crazy pills? FWIW, this is a stock Python 2.6 installation.

like image 862
Ted Dziuba Avatar asked Oct 09 '09 18:10

Ted Dziuba


People also ask

What does encoding =' UTF-8 do in Python?

UTF-8 is a byte oriented encoding. The encoding specifies that each character is represented by a specific sequence of one or more bytes.


1 Answers

Having code like:

raise Exception(u'щ') 

Caused:

  File "/usr/lib/python2.7/logging/__init__.py", line 467, in format     s = self._fmt % record.__dict__ UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-3: ordinal not in range(128) 

This happens because the format string is a byte string, while some of the format string arguments are unicode strings with non-ASCII characters:

>>> "%(message)s" % {'message': Exception(u'\u0449')} *** UnicodeEncodeError: 'ascii' codec can't encode character u'\u0449' in position 0: ordinal not in range(128) 

Making the format string unicode fixes the issue:

>>> u"%(message)s" % {'message': Exception(u'\u0449')} u'\u0449' 

So, in your logging configuration make all format string unicode:

'formatters': {     'simple': {         'format': u'%(asctime)-s %(levelname)s [%(name)s]: %(message)s',         'datefmt': '%Y-%m-%d %H:%M:%S',     },  ... 

And patch the default logging formatter to use unicode format string:

logging._defaultFormatter = logging.Formatter(u"%(message)s") 
like image 131
warvariuc Avatar answered Sep 18 '22 18:09

warvariuc