Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to print tuples of unicode strings in original language (not u'foo' form)

Tags:

python

unicode

I have a list of tuples of unicode objects:

>>> t = [('亀',), ('犬',)]

Printing this out, I get:

>>> print t
[('\xe4\xba\x80',), ('\xe7\x8a\xac',)]

which I guess is a list of the utf-8 byte-code representation of those strings?

but what I want to see printed out is, surprise:

[('亀',), ('犬',)]

but I'm having an inordinate amount of trouble getting the bytecode back into a human-readable form.

like image 784
Daniel H Avatar asked Mar 07 '09 04:03

Daniel H


2 Answers

but what I want to see printed out is, surprise:

[('亀',), ('犬',)]

What do you want to see it printed out on? Because if it's the console, it's not at all guaranteed your console can display those characters. This is why Python's ‘repr()’ representation of objects goes for the safe option of \-escapes, which you will always be able to see on-screen and type in easily.

As a prerequisite you should be using Unicode strings (u''). And, as mentioned by Matthew, if you want to be able to write u'亀' directly in source you need to make sure Python can read the file's encoding. For occasional use of non-ASCII characters it is best to stick with the escaped version u'\u4e80', but when you have a lot of East Asian text you want to be able to read, “# coding=utf-8” is definitely the way to go.

print '[%s]' % ', '.join([', '.join('(%s,)' % ', '.join(ti) for ti in t)])

That would print the characters unwrapped by quotes. Really you'd want:

def reprunicode(u):
    return repr(u).decode('raw_unicode_escape')

print u'[%s]' % u', '.join([u'(%s,)' % reprunicode(ti[0]) for ti in t])

This would work, but if the console didn't support Unicode (and this is especially troublesome on Windows), you'll get a big old UnicodeError.

In any case, this rarely matters because the repr() of an object, which is what you're seeing here, doesn't usually make it to the public user interface of an application; it's really for the coder only.

However, you'll be pleased to know that Python 3.0 behaves exactly as you want:

  • plain '' strings without the ‘u’ prefix are now Unicode strings
  • repr() shows most Unicode characters verbatim
  • Unicode in the Windows console is better supported (you can still get UnicodeError on Unix if your environment isn't UTF-8)

Python 3.0 is a little bit new and not so well-supported by libraries, but it might well suit your needs better.

like image 90
bobince Avatar answered Oct 23 '22 01:10

bobince


First, there's a slight misunderstanding in your post. If you define a list like this:

>>> t = [('亀',), ('犬',)]

...those are not unicodes you define, but strs. If you want to have unicode types, you have to add a u before the character:

>>> t = [(u'亀',), (u'犬',)]

But let's assume you actually want strs, not unicodes. The main problem is, __str__ method of a list (or a tuple) is practically equal to its __repr__ method (which returns a string that, when evaluated, would create exactly the same object). Because __repr__ method should be encoding-independent, strings are represented in the safest mode possible, i.e. each character outside of ASCII range is represented as a hex character (\xe4, for example).

Unfortunately, as far as I know, there's no library method for printing a list that is locale-aware. You could use an almost-general-purpose function like this:

def collection_str(collection):
    if isinstance(collection, list):
        brackets = '[%s]'
        single_add = ''
    elif isinstance(collection, tuple):
        brackets = '(%s)'
        single_add =','
    else:
        return str(collection)
    items = ', '.join([collection_str(x) for x in collection])
    if len(collection) == 1:
        items += single_add
    return brackets % items

>>> print collection_str(t)
[('亀',), ('犬',)]

Note that this won't work for all possible collections (sets and dictionaries, for example), but it's easy to extend it to handle those.

like image 37
DzinX Avatar answered Oct 23 '22 01:10

DzinX