I'm struggling with print and unicode conversion. Here is some code executed in the 2.5 windows interpreter.
>>> import sys
>>> print sys.stdout.encoding
cp850
>>> print u"é"
é
>>> print u"é".encode("cp850")
é
>>> print u"é".encode("utf8")
├®
>>> print u"é".__repr__()
u'\xe9'
>>> class A():
... def __unicode__(self):
... return u"é"
...
>>> print A()
<__main__.A instance at 0x0000000002AEEA88>
>>> class B():
... def __repr__(self):
... return u"é".encode("cp850")
...
>>> print B()
é
>>> class C():
... def __repr__(self):
... return u"é".encode("utf8")
...
>>> print C()
├®
>>> class D():
... def __str__(self):
... return u"é"
...
>>> print D()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 0: ordinal not in range(128)
>>> class E():
... def __repr__(self):
... return u"é"
...
>>> print E()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xe9' in position 0: ordinal not in range(128)
So, when a unicode string is printed, it's not it's __repr__()
function which is called and printed.
But when an object is printed __str__()
or __repr__()
(if __str__
not implemented) is called, not __unicode__()
. Both can not return a unicode string.
But why? Why if __repr__()
or __str__()
return a unicode string, shouldn't it be the same behavior than when we print a unicode string? I other words: why print D()
is different from print D().__str__()
Am I missing something?
These samples also show that if you want to print an object represented with unicode strings, you have to encode it to a object string (type str). But for nice printing (avoid the "├®"), it's dependent of the sys.stdout
encoding.
So, do I have to add u"é".encode(sys.stdout.encoding)
for each of my __str__
or __repr__
method? Or return repr(u"é")?
What if I use piping? Is is the same encoding than sys.stdout
?
My main issue is to make a class "printable", i.e. print A()
prints something fully readable (not with the \x*** unicode characters).
Here is the bad behavior/code that needs to be modified:
class User(object):
name = u"Luiz Inácio Lula da Silva"
def __repr__(self):
# returns unicode
return "<User: %s>" % self.name
# won't display gracefully
# expl: print repr(u'é') -> u'\xe9'
return repr("<User: %s>" % self.name)
# won't display gracefully
# expl: print u"é".encode("utf8") -> print '\xc3\xa9' -> ├®
return ("<User: %s>" % self.name).encode("utf8")
Thanks!
Python doesn't have many semantic type constraints on given functions and methods, but it has a few, and here's one of them: __str__
(in Python 2.*) must return a byte string. As usual, if a unicode object is found where a byte string is required, the current default encoding (usually 'ascii'
) is applied in the attempt to make the required byte string from the unicode object in question.
For this operation, the encoding (if any) of any given file object is irrelevant, because what's being returned from __str__
may be about to be printed, or may be going to be subject to completely different and unrelated treatment. Your purpose in calling __str__
does not matter to the call itself and its results; Python, in general, doesn't take into account the "future context" of an operation (what you are going to do with the result after the operation is done) in determining the operation's semantics.
That's because Python doesn't always know your future intentions, and it tries to minimize the amount of surprise. print str(x)
and s = str(x); print s
(the same operations performed in one gulp vs two), in particular, must have the same effects; if the second case, there will be an exception if str(x)
cannot validly produce a byte string (that is, for example, x.__str__()
can't), and therefore the exception should also occur in the other case.
print
itself (since 2.4, I believe), when presented with a unicode object, takes into consideration the .encoding
attribute (if any) of the target stream (by default sys.stdout
); other operations, as yet unconnected to any given target stream, don't -- and str(x)
(i.e. x.__str__()
) is just such an operation.
Hope this helped show the reason for the behavior that is annoying you...
Edit: the OP now clarifies "My main issue is to make a class "printable", i.e. print A() prints something fully readable (not with the \x*** unicode characters).". Here's the approach I think works best for that specific goal:
import sys
DEFAULT_ENCODING = 'UTF-8' # or whatever you like best
class sic(object):
def __unicode__(self): # the "real thing"
return u'Pel\xe9'
def __str__(self): # tries to "look nice"
return unicode(self).encode(sys.stdout.encoding or DEFAULT_ENCODING,
'replace')
def __repr__(self): # must be unambiguous
return repr(unicode(self))
That is, this approach focuses on __unicode__
as the primary way for the class's instances to format themselves -- but since (in Python 2) print
calls __str__
instead, it has that one delegate to __unicode__
with the best it can do in terms of encoding. Not perfect, but then Python 2's print
statement is far from perfect anyway;-).
__repr__
, for its part, must strive to be unambiguous, that is, not to "look nice" at the expense of risking ambiguity (ideally, when feasible, it should return a byte string that, if passed to eval
, would make an instance equal to the present one... that's far from always feasible, but the lack of ambiguity is the absolute core of the distinction between __str__
and __repr__
, and I strongly recommend respecting that distinction!).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With