Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to handle the encode in lxml to parse html-string properly?

Tags:

python

lxml

I have a xml file. please download it and save it as blog.xml. It is the list of my files in Google-blogger, i write some codes to parse it ,there is a something wring with lxml .

code1:

from stripogram import html2text
import feedparser
d = feedparser.parse('blog.xml')
for num,entry in enumerate(d.entries):
    string=entry.content[0]['value'].encode("utf-8")
    print   html2text(string)

It get a right result with code1.

code2:

import lxml.html
import feedparser
d = feedparser.parse('blog.xml')
for num,entry in enumerate(d.entries):
    string=entry.content[0]['value'] 
    myhtml=lxml.html.document_fromstring(string)
    print  myhtml.text_content()

It get a wrong output with code2.

Traceback (most recent call last):
  File "<stdin>", line 3, in <module>
  File "/usr/lib/python2.7/dist-packages/lxml/html/__init__.py", line 532, in document_fromstring
    value = etree.fromstring(html, parser, **kw)
  File "lxml.etree.pyx", line 2754, in lxml.etree.fromstring (src/lxml/lxml.etree.c:54631)
  File "parser.pxi", line 1569, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:82659)
 ValueError: Unicode strings with encoding declaration are not supported.

code3:

import lxml.html
import feedparser
d = feedparser.parse('blog.xml')
for num,entry in enumerate(d.entries):
    string=entry.content[0]['value'].encode("utf-8") 
    myhtml=lxml.html.document_fromstring(string)
    print  myhtml.text_content()

It get a wrong output with code3.

 Traceback (most recent call last):
  File "<stdin>", line 3, in <module>
  File "/usr/lib/python2.7/dist-packages/lxml/html/__init__.py", line 532, in document_fromstring
    value = etree.fromstring(html, parser, **kw)
  File "lxml.etree.pyx", line 2754, in lxml.etree.fromstring (src/lxml/lxml.etree.c:54631)
  File "parser.pxi", line 1578, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:82748)
  File "parser.pxi", line 1457, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:81546)
  File "parser.pxi", line 965, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:78216)
  File "parser.pxi", line 569, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:74472)
  File "parser.pxi", line 650, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:75363)
  File "parser.pxi", line 599, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:74827)
 lxml.etree.XMLSyntaxError: line 1395: Tag b:include invalid

How to handle the encode in lxml to parse html-string properly?

like image 628
showkey Avatar asked Apr 07 '13 11:04

showkey


2 Answers

There is a bug in lxml. Check output of this code:

import lxml.html
import feedparser

def test():
    try:
        lxml.html.document_fromstring('')
    except Exception as e:
        print e

d = feedparser.parse('blog.xml')
e = d.entries[0].content[0]['value'].encode('utf-8')

test() # XMLSyntaxError: None

lxml.html.document_fromstring(e)
test() # XMLSyntaxError: line 1407: Tag b:include invalid

So the error is confusing, the real reason why your parsing fails is that you pass empty strings to document_fromstring.

Try this code:

import lxml.html
import feedparser
d = feedparser.parse('blog.xml')
for num,entry in enumerate(d.entries):
    string=entry.content[0]['value'].encode("utf-8") 
    if not string:
        continue
    myhtml=lxml.html.document_fromstring(string)
    print  myhtml.text_content()
like image 131
gatto Avatar answered Sep 20 '22 15:09

gatto


You could create yourself a parser, instead of using document_fromstring:

from cStringIO import StringIO
from lxml import etree

for num, entry in enumerate(d.entries):
    text = entry.content[0]['value'].encode('utf8')
    parser = etree.HTMLParser()
    tree   = etree.parse(StringIO(text), parser)
    print  ''.join(tree.xpath('.//text()'))

For Blogger.com Atom feed exports, this works to print the text content of the .content[0].value entry.

like image 22
Martijn Pieters Avatar answered Sep 21 '22 15:09

Martijn Pieters