I have a xml file. please download it and save it as blog.xml
.
It is the list of my files in Google-blogger, i write some codes to parse it ,there is a something wring with lxml .
code1:
from stripogram import html2text
import feedparser
d = feedparser.parse('blog.xml')
for num,entry in enumerate(d.entries):
string=entry.content[0]['value'].encode("utf-8")
print html2text(string)
It get a right result with code1.
code2:
import lxml.html
import feedparser
d = feedparser.parse('blog.xml')
for num,entry in enumerate(d.entries):
string=entry.content[0]['value']
myhtml=lxml.html.document_fromstring(string)
print myhtml.text_content()
It get a wrong output with code2.
Traceback (most recent call last):
File "<stdin>", line 3, in <module>
File "/usr/lib/python2.7/dist-packages/lxml/html/__init__.py", line 532, in document_fromstring
value = etree.fromstring(html, parser, **kw)
File "lxml.etree.pyx", line 2754, in lxml.etree.fromstring (src/lxml/lxml.etree.c:54631)
File "parser.pxi", line 1569, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:82659)
ValueError: Unicode strings with encoding declaration are not supported.
code3:
import lxml.html
import feedparser
d = feedparser.parse('blog.xml')
for num,entry in enumerate(d.entries):
string=entry.content[0]['value'].encode("utf-8")
myhtml=lxml.html.document_fromstring(string)
print myhtml.text_content()
It get a wrong output with code3.
Traceback (most recent call last):
File "<stdin>", line 3, in <module>
File "/usr/lib/python2.7/dist-packages/lxml/html/__init__.py", line 532, in document_fromstring
value = etree.fromstring(html, parser, **kw)
File "lxml.etree.pyx", line 2754, in lxml.etree.fromstring (src/lxml/lxml.etree.c:54631)
File "parser.pxi", line 1578, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:82748)
File "parser.pxi", line 1457, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:81546)
File "parser.pxi", line 965, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:78216)
File "parser.pxi", line 569, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:74472)
File "parser.pxi", line 650, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:75363)
File "parser.pxi", line 599, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:74827)
lxml.etree.XMLSyntaxError: line 1395: Tag b:include invalid
How to handle the encode in lxml to parse html-string properly?
There is a bug in lxml. Check output of this code:
import lxml.html
import feedparser
def test():
try:
lxml.html.document_fromstring('')
except Exception as e:
print e
d = feedparser.parse('blog.xml')
e = d.entries[0].content[0]['value'].encode('utf-8')
test() # XMLSyntaxError: None
lxml.html.document_fromstring(e)
test() # XMLSyntaxError: line 1407: Tag b:include invalid
So the error is confusing, the real reason why your parsing fails is that you pass empty strings to document_fromstring.
Try this code:
import lxml.html
import feedparser
d = feedparser.parse('blog.xml')
for num,entry in enumerate(d.entries):
string=entry.content[0]['value'].encode("utf-8")
if not string:
continue
myhtml=lxml.html.document_fromstring(string)
print myhtml.text_content()
You could create yourself a parser, instead of using document_fromstring
:
from cStringIO import StringIO
from lxml import etree
for num, entry in enumerate(d.entries):
text = entry.content[0]['value'].encode('utf8')
parser = etree.HTMLParser()
tree = etree.parse(StringIO(text), parser)
print ''.join(tree.xpath('.//text()'))
For Blogger.com Atom feed exports, this works to print the text content of the .content[0].value
entry.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With