Set encoding in Python 3 CGI scripts

Tags:

When writing a Python 3.1 CGI script, I run into horrible UnicodeDecodeErrors. However, when running the script on the command line, everything works.

It seems that open() and print() use the return value of locale.getpreferredencoding() to know what encoding to use by default. When running on the command line, that value is 'UTF-8', as it should be. But when running the script through a browser, the encoding mysteriously gets redefined to 'ANSI_X3.4-1968', which appears to be a just a fancy name for plain ASCII.

I now need to know how to make the cgi script run with 'utf-8' as the default encoding in all cases. My setup is Python 3.1.3 and Apache2 on Debian Linux. The system-wide locale is en_GB.utf-8.

548

asked Feb 17 '12 03:02

jforberg

2 Answers

Answering this for late-comers because I don't think that the posted answers get to the root of the problem, which is the lack of locale environment variables in a CGI context. I'm using Python 3.2.

open() opens file objects in text (string) or binary (bytes) mode for reading and/or writing; in text mode the encoding used to encode strings written to the file, and decode bytes read from the file, may be specified in the call; if it isn't then it is determined by locale.getpreferredencoding(), which on linux uses the encoding from your locale environment settings, which is normally utf-8 (from e.g. LANG=en_US.UTF-8)
```
>>> f = open('foo', 'w')         # open file for writing in text mode >>> f.encoding 'UTF-8'                          # encoding is from the environment >>> f.write('€')                 # write a Unicode string 1 >>> f.close() >>> exit() user@host:~$ hd foo 00000000  e2 82 ac      |...|    # data is UTF-8 encoded 
```
sys.stdout is in fact a file opened for writing in text mode with an encoding based on locale.getpreferredencoding(); you can write strings to it just fine and they'll be encoded to bytes based on sys.stdout's encoding; print() by default writes to sys.stdout - print() itself has no encoding, rather it's the file it writes to that has an encoding;
```
>>> sys.stdout.encoding 'UTF-8'                          # encoding is from the environment >>> exit() user@host:~$ python3 -c 'print("€")' > foo user@host:~$ hd foo 00000000  e2 82 ac 0a   |....|   # data is UTF-8 encoded; \n is from print() 
```
; you cannot write bytes to sys.stdout - use sys.stdout.buffer.write() for that; if you try to write bytes to sys.stdout using sys.stdout.write() then it will return an error, and if you try using print() then print() will simply turn the bytes object into a string object and an escape sequence like \xff will be treated as the four characters \, x, f, f
```
user@host:~$ python3 -c 'print(b"\xe2\xf82\xac")' > foo user@host:~$ hd foo 00000000  62 27 5c 78 65 32 5c 78  66 38 32 5c 78 61 63 27  |b'\xe2\xf82\xac'| 00000010  0a                                                |.| 
```
in a CGI script you need to write to sys.stdout and you can use print() to do it; but a CGI script process in Apache has no locale environment settings - they are not part of the CGI specification; therefore the sys.stdout encoding defaults to ANSI_X3.4-1968 - in other words, ASCII; if you try to print() a string that contain non-ASCII characters to sys.stdout you'll get "UnicodeEncodeError: 'ascii' codec can't encode character...: ordinal not in range(128)"
a simple solution is to pass the Apache process's LANG environment variable through to the CGI script using Apache's mod_env PassEnv command in the server or virtual host configuration: PassEnv LANG; on Debian/Ubuntu make sure that in /etc/apache2/envvars you have uncommented the line ". /etc/default/locale" so that Apache runs with the system default locale and not the C (Posix) locale (which is also ASCII encoding); the following CGI script should run without errors in Python 3.2:
```
#!/usr/bin/env python3 import sys print('Content-Type: text/html; charset=utf-8') print() print('<html><body><pre>' + sys.stdout.encoding + '</pre>h€lló wörld<body></html>') 
```

answered Oct 09 '22 01:10

cercatrova

I solved my problem with the following code:

import locale                                  # Ensures that subsequent open()s  locale.getpreferredencoding = lambda: 'UTF-8'  # are UTF-8 encoded.  import sys                                      sys.stdin = open('/dev/stdin', 'r')       # Re-open standard files in UTF-8  sys.stdout = open('/dev/stdout', 'w')     # mode. sys.stderr = open('/dev/stderr', 'w')

This solution is not pretty, but it seems to work for the time being. I actually chose Python 3 over the more commonplace v. 2.6 as my development platform due to the advertised good Unicode-handling, but the cgi package seems to ruin some of that simpleness.

I'm led to believe that the /dev/std* files may not exist on older systems that do not have a procfs. They are supported on recent Linuxes, however.

answered Oct 09 '22 02:10

jforberg

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Set encoding in Python 3 CGI scripts

Tags:

jforberg

People also ask

2 Answers

cercatrova

jforberg

Recent Activity

Donate For Us

Set encoding in Python 3 CGI scripts

Tags:

jforberg

People also ask

2 Answers

cercatrova

jforberg

Related questions

Recent Activity

Donate For Us