I have a Python script that loads a web page using urllib2.urlopen
, does some various magic, and spits out the results using print
. We then run the program on Windows like so:
python program.py > output.htm
Here's the problem:
The urlopen
reads data from an IIS web server which outputs UTF8. It spits out this same data to the output, however certain characters (such as the long hyphen that Word always inserts for you against your will because it's smarter than you) get garbled and end up like –
instead.
Upon further investigation, I noticed even though the web server spits out UTF8 data, the output.htm
file is encoded with the ISO-8859-1 character set.
My questions:
output.htm
as a command line parameter and write to that file instead of the screen, but I'd have to redo a whole bunch of logic in my program.Thanks for any help!
UPDATE:
At the top of output.htm
I added:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
However, it makes no difference. The characters are still garbled. If I manually switch over to UTF-8 in Firefox, the file displays correctly. Both IE and FF think this file is Western ISO even though it is clearly not.
From your comments and question update it seems that the data is correctly encoded in UTF-8. This means you just need to tell your browser it's UTF-8, either by using a BOM, or better, by adding encoding information to your HTML document:
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
You really shouldn't use an XML declaration if the document is no valid XML.
The best and most reliable way would be to serve the file via HTTP and set the Content-Type:
header appropriately.
When you pipe a Python program to an output file on Windows, does it always use this character set?
Default encoding used to output to pipe. On my machine:
In [5]: sys.getdefaultencoding()
Out[5]: 'ascii'
If not, is there a workaround?
import sys
try:
sys.setappdefaultencoding('utf-8')
except:
sys = reload(sys)
sys.setdefaultencoding('utf-8')
Now all output is encoded to 'utf-8'.
I think correct way to handle this situation without
redo a whole bunch of logic
is to decode all data from your internet source from server or page encoding to unicode
, and then to use workaround shown above to set default encoding to utf-8
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With