Encoding error while parsing RSS with lxml

Tags:

I want to parse downloaded RSS with lxml, but I don't know how to handle with UnicodeDecodeError?

request = urllib2.Request('http://wiadomosci.onet.pl/kraj/rss.xml')
response = urllib2.urlopen(request)
response = response.read()
encd = chardet.detect(response)['encoding']
parser = etree.XMLParser(ns_clean=True,recover=True,encoding=encd)
tree = etree.parse(response, parser)

But I get an error:

tree   = etree.parse(response, parser)
File "lxml.etree.pyx", line 2692, in lxml.etree.parse (src/lxml/lxml.etree.c:49594)
  File "parser.pxi", line 1500, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:71364)
  File "parser.pxi", line 1529, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:71647)
  File "parser.pxi", line 1429, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:70742)
  File "parser.pxi", line 975, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:67
740)
  File "parser.pxi", line 539, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etr
ee.c:63824)
  File "parser.pxi", line 625, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:64745)
  File "parser.pxi", line 559, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:64027)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 97: ordinal not in range(128)

615

asked Apr 27 '11 23:04

domi

2 Answers

I ran into a similar problem, and it turns out this has NOTHING to do with encodings. What's happening is this - lxml is throwing you a totally unrelated error. In this case, the error is that the .parse function expects a filename or URL, and not a string with the contents itself. However, when it tries to print out the error, it chokes on non-ascii characters and shows that completely confusing error message. It is highly unfortunate and other people have commented on this issue here:

https://mailman-mail5.webfaction.com/pipermail/lxml/2009-February/004393.html

Luckily, yours is a very easy fix. Just replace .parse with .fromstring and you should be totally good to go:

request = urllib2.Request('http://wiadomosci.onet.pl/kraj/rss.xml')
response = urllib2.urlopen(request)
response = response.read()
encd = chardet.detect(response)['encoding']
parser = etree.XMLParser(ns_clean=True,recover=True,encoding=encd)

## lxml Y U NO MAKE SENSE!!!
tree = etree.fromstring(response, parser)

Just tested this on my machine and it worked fine. Hope it helps!

159

answered Sep 25 '22 12:09

Luiz Scheidegger

It's often easier to get the string loaded and sorted out for the lxml library first, and then call fromstring on it, rather than rely on the lxml.etree.parse() function and its difficult to manage encoding options.

This particular rss file begins with the encoding declaration, so everything should just work:

<?xml version="1.0" encoding="utf-8"?>

The following code shows some of the different variations you can apply to make etree parse for different encodings. You can also request it to write out different encodings too, which will appear in the headers.

import lxml.etree
import urllib2

request = urllib2.Request('http://wiadomosci.onet.pl/kraj/rss.xml')
response = urllib2.urlopen(request).read()
print [response]
        # ['<?xml version="1.0" encoding="utf-8"?>\n<feed xmlns=... <title>Wiadomo\xc5\x9bci...']

uresponse = response.decode("utf8")
print [uresponse]    
        # [u'<?xml version="1.0" encoding="utf-8"?>\n<feed xmlns=... <title>Wiadomo\u015bci...']

tree = lxml.etree.fromstring(response)
res = lxml.etree.tostring(tree)
print [res]
        # ['<feed xmlns="http://www.w3.org/2005/Atom">\n<title>Wiadomo&#347;ci...']

lres = lxml.etree.tostring(tree, encoding="latin1")
print [lres]
        # ["<?xml version='1.0' encoding='latin1'?>\n<feed xmlns=...<title>Wiadomo&#347;ci...']


# works because the 38 character encoding declaration is sliced off
print lxml.etree.fromstring(uresponse[38:])   

# throws ValueError(u'Unicode strings with encoding declaration are not supported.',)
print lxml.etree.fromstring(uresponse)

Code can be tried here: http://scraperwiki.com/scrapers/lxml_and_encoding_declarations/edit/#

answered Sep 22 '22 12:09

Julian Todd

Related questions
                            
                                Import http.client encouter import error with Python 3.4.1
                            
                                Is there a way to specify a default value for python enums?
                            
                                MAPE calculation in python
                            
                                How to zip a folder and file in python? [duplicate]
                            
                                Failed to install "Cairocffi"
                            
                                How can I enumerate/list all installed applications in Windows XP?
                            
                                Popen and python
                            
                                python: combine sort-key-functions itemgetter and str.lower
                            
                                Plotting points in python
                            
                                Project Euler 5 in Python - How can I optimize my solution?
                            
                                How do I stop tkinter after function?
                            
                                Why does the 'int' object is not callable error occur when using the sum() function? [duplicate]
                            
                                Parsing XML - right scripting languages / packages for the job?
                            
                                convert a string such that the first letter is uppercase and everythingelse is lower case [duplicate]
                            
                                Finding mean of a values in a dictionary without using .values() etc
                            
                                Flask: redirect to same page after form submission
                            
                                How to detect whether two files are identical in Python [duplicate]
                            
                                Python multi-dimensional array initialization without a loop
                            
                                Python Enum class (with tostring fromstring)
                            
                                Python f-string formatting not working with strftime inline

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Encoding error while parsing RSS with lxml

Tags:

python

rss

lxml

chardet

scraperwiki

domi

People also ask

2 Answers

Luiz Scheidegger

Julian Todd

Recent Activity

Donate For Us