Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Encoding error while parsing RSS with lxml

I want to parse downloaded RSS with lxml, but I don't know how to handle with UnicodeDecodeError?

request = urllib2.Request('http://wiadomosci.onet.pl/kraj/rss.xml')
response = urllib2.urlopen(request)
response = response.read()
encd = chardet.detect(response)['encoding']
parser = etree.XMLParser(ns_clean=True,recover=True,encoding=encd)
tree = etree.parse(response, parser)

But I get an error:

tree   = etree.parse(response, parser)
File "lxml.etree.pyx", line 2692, in lxml.etree.parse (src/lxml/lxml.etree.c:49594)
  File "parser.pxi", line 1500, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:71364)
  File "parser.pxi", line 1529, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:71647)
  File "parser.pxi", line 1429, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:70742)
  File "parser.pxi", line 975, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:67
740)
  File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etr
ee.c:63824)
  File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:64745)
  File "parser.pxi", line 559, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64027)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 97: ordinal not in range(128)
like image 615
domi Avatar asked Apr 27 '11 23:04

domi


People also ask

What is lxml HTML?

lxml is the most feature-rich and easy-to-use library for processing XML and HTML in the Python language.


2 Answers

I ran into a similar problem, and it turns out this has NOTHING to do with encodings. What's happening is this - lxml is throwing you a totally unrelated error. In this case, the error is that the .parse function expects a filename or URL, and not a string with the contents itself. However, when it tries to print out the error, it chokes on non-ascii characters and shows that completely confusing error message. It is highly unfortunate and other people have commented on this issue here:

https://mailman-mail5.webfaction.com/pipermail/lxml/2009-February/004393.html

Luckily, yours is a very easy fix. Just replace .parse with .fromstring and you should be totally good to go:

request = urllib2.Request('http://wiadomosci.onet.pl/kraj/rss.xml')
response = urllib2.urlopen(request)
response = response.read()
encd = chardet.detect(response)['encoding']
parser = etree.XMLParser(ns_clean=True,recover=True,encoding=encd)

## lxml Y U NO MAKE SENSE!!!
tree = etree.fromstring(response, parser)

Just tested this on my machine and it worked fine. Hope it helps!

like image 159
Luiz Scheidegger Avatar answered Sep 25 '22 12:09

Luiz Scheidegger


It's often easier to get the string loaded and sorted out for the lxml library first, and then call fromstring on it, rather than rely on the lxml.etree.parse() function and its difficult to manage encoding options.

This particular rss file begins with the encoding declaration, so everything should just work:

<?xml version="1.0" encoding="utf-8"?>

The following code shows some of the different variations you can apply to make etree parse for different encodings. You can also request it to write out different encodings too, which will appear in the headers.

import lxml.etree
import urllib2

request = urllib2.Request('http://wiadomosci.onet.pl/kraj/rss.xml')
response = urllib2.urlopen(request).read()
print [response]
        # ['<?xml version="1.0" encoding="utf-8"?>\n<feed xmlns=... <title>Wiadomo\xc5\x9bci...']

uresponse = response.decode("utf8")
print [uresponse]    
        # [u'<?xml version="1.0" encoding="utf-8"?>\n<feed xmlns=... <title>Wiadomo\u015bci...']

tree = lxml.etree.fromstring(response)
res = lxml.etree.tostring(tree)
print [res]
        # ['<feed xmlns="http://www.w3.org/2005/Atom">\n<title>Wiadomo&#347;ci...']

lres = lxml.etree.tostring(tree, encoding="latin1")
print [lres]
        # ["<?xml version='1.0' encoding='latin1'?>\n<feed xmlns=...<title>Wiadomo&#347;ci...']


# works because the 38 character encoding declaration is sliced off
print lxml.etree.fromstring(uresponse[38:])   

# throws ValueError(u'Unicode strings with encoding declaration are not supported.',)
print lxml.etree.fromstring(uresponse)

Code can be tried here: http://scraperwiki.com/scrapers/lxml_and_encoding_declarations/edit/#

like image 33
Julian Todd Avatar answered Sep 22 '22 12:09

Julian Todd