I have encoding problems when serving a simple web page in python3, using BaseHTTPRequestHandler.
Here is a working example:
#!/usr/bin/python3
# -*- coding: utf-8 -*
from http.server import BaseHTTPRequestHandler, HTTPServer
from os import curdir, sep, remove
import cgi
HTML_FILE_NAME = 'test.html'
PORT_NUMBER = 8080
# This class will handles any incoming request from the browser
class myHandler(BaseHTTPRequestHandler):
# Handler for the GET requests
def do_GET(self):
self.path = HTML_FILE_NAME
try:
with open(curdir + sep + self.path, 'r') as f:
self.send_response(200)
self.send_header('Content-type', 'text/html')
self.end_headers()
self.wfile.write(bytes(f.read(), 'UTF-8'))
return
except IOError:
self.send_error(404, 'File Not Found: %s' % self.path)
try:
# Create a web server and define the handler to manage the incoming request
with open(HTML_FILE_NAME, 'w') as f:
f.write('<!DOCTYPE html><html><body> <p> My name is Jérôme </p> </body></html>')
print('Started httpserver on port %i.' % PORT_NUMBER)
#Wait forever for incoming http requests
HTTPServer(('', PORT_NUMBER), myHandler).serve_forever()
except KeyboardInterrupt:
print('Interrupted by the user - shutting down the web server.')
server.socket.close()
remove(HTML_FILE_NAME)
The expected result is to serve a web page displaying My name is Jérôme.
Instead, I have: My name is Jérôme
As you can see, the html page is correctly encoded, with self.wfile.write(bytes(f.read(), 'UTF-8'))
, so I think the problem comes from the web server.
How to tell the web server to serve the page in UTF-8?
Set the Python encoding to UTF-8. This will ensure the fix for the current session . $ export PYTHONIOENCODING=utf8. Set the environment variables in /etc/default/locale . This way the system`s default locale encoding is set to the UTF-8 format. LANG="UTF-8" or "en_US.UTF-8" LC_ALL="UTF-8" or "en_US.UTF-8" LC_CTYPE="UTF-8" or "en_US.UTF-8".
In Python 3+, You can URL encode any string using the quote () function provided by urllib.parse package. The quote () function by default uses UTF-8 encoding scheme. Note that, the quote () function considers / character safe by default. That means, It doesn’t encode / character -
In Python 3 UTF-8 is the default source encoding When the encoding is not correctly set-up , it is commonly seen to throw an “”UnicodeDecodeError: ‘ascii’ codec can’t encode” error Python string function uses the default character encoding . Check sys.stdout
The encode () method encodes the string, using the specified encoding. If no encoding is specified, UTF-8 will be used.
No problem if I add:
<meta content="text/html;charset=utf-8" http-equiv="Content-Type">
<meta content="utf-8" http-equiv="encoding">
in my html head.
Your web server is already sending the text encoded to UTF-8 but you need to tell your browser the encoding of the bytes it receives. The HTTP spec. declares ISO-8995-1 as the default.
The HTTP standard way of doing is this is to tag the Content-type
header value with a charset
sub-key.
Therefore, you should change your code to read:
self.send_header('Content-type', 'text/html; charset=utf-8')
Also, watch out for the encoding of your HTML file. Without an encoding given to open()
, it'll be guessed based on your locale. This won't break anything, unless you end up running this script where the locale is C
, POSIX
or non-latin Windows.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With