Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

parsing XML file gets UnicodeEncodeError (ElementTree) / ValueError (lxml)

I send a GET request to the CareerBuilder API :

import requests

url = "http://api.careerbuilder.com/v1/jobsearch"
payload = {'DeveloperKey': 'MY_DEVLOPER_KEY',
           'JobTitle': 'Biologist'}
r = requests.get(url, params=payload)
xml = r.text

And get back an XML that looks like this. However, I have trouble parsing it.

Using either lxml

>>> from lxml import etree
>>> print etree.fromstring(xml)

Traceback (most recent call last):
  File "<pyshell#4>", line 1, in <module>
    print etree.fromstring(xml)
  File "lxml.etree.pyx", line 2992, in lxml.etree.fromstring (src\lxml\lxml.etree.c:62311)
  File "parser.pxi", line 1585, in lxml.etree._parseMemoryDocument (src\lxml\lxml.etree.c:91625)
ValueError: Unicode strings with encoding declaration are not supported.

or ElementTree:

Traceback (most recent call last):
  File "<pyshell#3>", line 1, in <module>
    print ET.fromstring(xml)
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 1301, in XML
    parser.feed(text)
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 1641, in feed
    self._parser.Parse(data, 0)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 3717: ordinal not in range(128)

So, even though the XML file starts with

<?xml version="1.0" encoding="UTF-8"?>

I have the impression that it contains characters that are not allowed. How do I parse this file with either lxmlor ElementTree?

like image 637
BioGeek Avatar asked Mar 25 '13 18:03

BioGeek


1 Answers

You are using the decoded unicode value. Use r.raw raw response data instead:

r = requests.get(url, params=payload, stream=True)
r.raw.decode_content = True
etree.parse(r.raw)

which will read the data from the response directly; do note the stream=True option to .get().

Setting the r.raw.decode_content = True flag ensures that the raw socket will give you the decompressed content even if the response is gzip or deflate compressed.

You don't have to stream the response; for smaller XML documents it is fine to use the response.content attribute, which is the un-decoded response body:

r = requests.get(url, params=payload)
xml = etree.fromstring(r.content)

XML parsers always expect bytes as input as the XML format itself dictates how the parser is to decode those bytes to Unicode text.

like image 90
Martijn Pieters Avatar answered Oct 05 '22 23:10

Martijn Pieters