Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Setting the encoding for sax parser in Python

When I feed a utf-8 encoded xml to an ExpatParser instance:

def test(filename):
    parser = xml.sax.make_parser()
    with codecs.open(filename, 'r', encoding='utf-8') as f:
        for line in f:
            parser.feed(line)

...I get the following:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "test.py", line 72, in search_test
    parser.feed(line)
  File "/System/Library/Frameworks/Python.framework/Versions/2.5/lib/python2.5/xml/sax/expatreader.py", line 207, in feed
    self._parser.Parse(data, isFinal)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xb4' in position 29: ordinal not in range(128)

I'm probably missing something obvious here. How do I change the parser's encoding from 'ascii' to 'utf-8'?

like image 430
Dan Weaver Avatar asked May 13 '09 12:05

Dan Weaver


1 Answers

Your code fails in Python 2.6, but works in 3.0.

This does work in 2.6, presumably because it allows the parser itself to figure out the encoding (perhaps by reading the encoding optionally specified on the first line of the XML file, and otherwise defaulting to utf-8):

def test(filename):
    parser = xml.sax.make_parser()
    parser.parse(open(filename))
like image 139
Stephan202 Avatar answered Nov 14 '22 23:11

Stephan202