Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python encoding - Is there any explanation?

Can someone explain to me why python has this behaviour?

Let's me explain.

BACKGROUND

I have a python installation and I want to use some chars that aren't in the ASCII table. So I change my python default enconding. I save every string, into a file .py, in that way '_MAIL_TITLE_': u'Бронирование номеров',

Now, with a method that replaces my dictionary keys, I want to insert into an html template my strings in a dynamic way.

I place into html page's header:

<head>
 <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
 ...... <!-- Some Css's --> 
</head>

Unfortunately, my html doc comes to me (after those replaces) with some wrong chars (unconverted? misconverted?)

So, I open a terminal and start to make some order:

 1 - Python 2.4.6 (#1, Jan 27 2012, 15:41:03)
 2 - [GCC 4.1.2 20080704 (Red Hat 4.1.2-51)] on linux2
 3 - Type "help", "copyright", "credits" or "license" for more information.
 4 - >>> import sys
 5 - >>> sys.getdefaultencoding()
 6 - 'utf-8'
 7 - >>> u'èéòç'
 8 - u'\xe8\xe9\xf2\xe7'
 9 - >>> u'èéòç'.encode('utf-8')
10 - '\xc3\xa8\xc3\xa9\xc3\xb2\xc3\xa7'
11 - >>> u'è'
12 - u'\xe8'
13 - >>> u'è'.encode()
14 - '\xc3\xa8'

QUESTION

Take a look at line [7-10]. Isn't that weird? Why if my (line 6) python has a defaultencoding utf-8, does it convert that string (line7) in a different way than line 9 does? Now, take a look at lines [11-14] and their output.

Now, i'm totally confused!

THE HINT

So, I've tried to change my terminal way of input files (previously ISO-8859-1, now utf-8) and something changed:

 1 - Python 2.4.6 (#1, Jan 27 2012, 15:41:03)
 2 - [GCC 4.1.2 20080704 (Red Hat 4.1.2-51)] on linux2
 3 - Type "help", "copyright", "credits" or "license" for more information.
 4 - >>> import sys
 5 - >>> sys.getdefaultencoding()
 6 - 'utf-8'
 7 - >>> u'èéòç'
 8 - u'\xc3\xc3\xa8\xc3\xa9\xc3\xb2\xc3\xa7'
 9 - >>> u'èéòç'.encode('utf-8')
10 - '\xc3\xa8\xc3\xa9\xc3\xb2\xc3\xa7'
11 - >>> u'è'
12 - u'\xe8'
13 - >>> u'è'.encode()
14 -'\xc3\xa8'

So, the encoding (explicit encoding) works independently from input encoding (or it seems to me, but I'm stuck on this for days, so maybe I messed up my mind).

WHERE IS THE SOLUTION??

By looking at lines 8 of background and hint, you can see that there are some differences of unicode's object that are created. So, I've started to thought about it. What have I concluded? Nothing. Nothing except that, maybe, my encoding problems are into file's encoding once a save my .py (that, contains all utf-8 characters that have to be inserted into html document)

THE "REAL" CODE

The code does nothing special: it opens an html template, place it into a string, replace place holders with unicode (utf-8ed ? wish yes) strings and save it into another file that will be visualizated from the Internet (yes, my "landing" page have into header utf-8's specifications). I don't have code here because it is scattered into several files, but I'm sure of the program's workflow (by tracing it).

FINAL QUESTION

In the light of this, does anybody have any idea for making my code work? Ideas about unix file encoding? Or .py file encoding? How can I change the encoding to make my code work?

LAST HINT

Before substitution of place holders with utf-8 object, if I insert a

utf8Obj.encode('latin-1')

my document is perfectly visible for the internet!

Thanks to those who answer.

EDIT1 - DEVELOPMENT WORKFLOW

Ok, that's my development workflow:

I have a CVS for that project. The project is located onto a centos OS. That server is a 64-bit machine. I develop my code into a Windows 7 (64-bit) with eclipse. Every modification is committed ONLY with CVS commit. The code is exectude onto Centos machine that use that kind of python:

Python 2.4.6 (#1, Jan 27 2012, 15:41:03)
[GCC 4.1.2 20080704 (Red Hat 4.1.2-51)] on linux2

I setted Eclipse for work in that way: PREFERENCES -> GENERAL -> WORKSPACE -> TEXT FILE ENCODING : UTF-8

A Zope/Plone application run onto the same Server: it serves some PHP pages. PHP pages calls some python methods (application logic) by WS that are located onto Zope/Plone "server". That server interface directly to application logic.

That's all

EDIT2

This is the function that does the replace:

    def _fillTemplate(self, buf):
    """_fillTemplate(buf)-->str
    Ritorna il documento con i campi sostituiti con dict_template.
    """
    try:    
        for k, v in self.dict_template.iteritems():
            if not isinstance(v,unicode):
                v=str(v)
            else:
                v=v.encode('latin-1') #In that way it works, but why?
            buf = buf.replace(k, v)
like image 632
DonCallisto Avatar asked Apr 26 '26 10:04

DonCallisto


1 Answers

While you answer to my comment, here is the answer of the first question:

Take a look to line [7-10]. Isn't weird? Why if my (line 6) python have a defaultencoding in utf-8, then convert that string (line7) in a different way that line 9 does? Now, take a look to lines [11-14] and their output..

No it's not weird: you must distinguish between Python encoding, shell encoding, system encoding, file encoding, declared file encoding and applied encoding. Makes a lot of of encoding, isn't it ?

sys.getdefaultencoding()

This will give you the encoding Python use for the unicode implementation. This as nothing to do with output.

In [7]: u'è'
Out[7]: u'\xe8'
In [8]: u'è'.encode('utf8')
Out[8]: '\xc3\xa8'
In [9]: print u'è'
è
In [10]: print u'è'.encode('utf8')
è

When you use print, the caracter is printed to the screen, if you don't, Python gives you the a representation that you can copy/paste to obtain the same data.

Since a unicode string is not the same as a utf8 string, it doesn't give you the same data.

Unicode is a "neutral" representation of the string, while utf8 is an encoded one.

like image 114
e-satis Avatar answered Apr 29 '26 00:04

e-satis