Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Peter Piper piped a Python program - and lost all his unicode characters

I have a Python script that loads a web page using urllib2.urlopen, does some various magic, and spits out the results using print. We then run the program on Windows like so:

python program.py > output.htm

Here's the problem:

The urlopen reads data from an IIS web server which outputs UTF8. It spits out this same data to the output, however certain characters (such as the long hyphen that Word always inserts for you against your will because it's smarter than you) get garbled and end up like – instead.

Upon further investigation, I noticed even though the web server spits out UTF8 data, the output.htm file is encoded with the ISO-8859-1 character set.

My questions:

  1. When you redirect a Python program to an output file on Windows, does it always use this character set?
  2. If so, is there any way to change that behavior?
  3. If not, is there a workaround? I suppose I could just pass in output.htm as a command line parameter and write to that file instead of the screen, but I'd have to redo a whole bunch of logic in my program.

Thanks for any help!

UPDATE:

At the top of output.htm I added:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">

However, it makes no difference. The characters are still garbled. If I manually switch over to UTF-8 in Firefox, the file displays correctly. Both IE and FF think this file is Western ISO even though it is clearly not.

like image 337
Mike Christensen Avatar asked Jan 06 '12 16:01

Mike Christensen


2 Answers

From your comments and question update it seems that the data is correctly encoded in UTF-8. This means you just need to tell your browser it's UTF-8, either by using a BOM, or better, by adding encoding information to your HTML document:

<head>
  <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>

You really shouldn't use an XML declaration if the document is no valid XML.

The best and most reliable way would be to serve the file via HTTP and set the Content-Type: header appropriately.

like image 150
Niklas B. Avatar answered Oct 20 '22 00:10

Niklas B.


When you pipe a Python program to an output file on Windows, does it always use this character set?

Default encoding used to output to pipe. On my machine:

In [5]: sys.getdefaultencoding()
Out[5]: 'ascii'

If not, is there a workaround?

import sys
try:
    sys.setappdefaultencoding('utf-8')
except:
    sys = reload(sys)
    sys.setdefaultencoding('utf-8')

Now all output is encoded to 'utf-8'.

I think correct way to handle this situation without

redo a whole bunch of logic

is to decode all data from your internet source from server or page encoding to unicode, and then to use workaround shown above to set default encoding to utf-8.

like image 21
reclosedev Avatar answered Oct 19 '22 22:10

reclosedev