Questions Linux Laravel Mysql Ubuntu Git Menu

HTML CSS JAVASCRIPT SQL PYTHON PHP BOOTSTRAP JAVA JQUERY R React Kotlin

How to handle the encode in lxml to parse html-string properly?

Tags:

python

lxml

I have a xml file. please download it and save it as blog.xml. It is the list of my files in Google-blogger, i write some codes to parse it ,there is a something wring with lxml .

code1:

from stripogram import html2text
import feedparser
d = feedparser.parse('blog.xml')
for num,entry in enumerate(d.entries):
    string=entry.content[0]['value'].encode("utf-8")
    print   html2text(string)

It get a right result with code1.

code2:

import lxml.html
import feedparser
d = feedparser.parse('blog.xml')
for num,entry in enumerate(d.entries):
    string=entry.content[0]['value'] 
    myhtml=lxml.html.document_fromstring(string)
    print  myhtml.text_content()

It get a wrong output with code2.

Traceback (most recent call last):
  File "<stdin>", line 3, in <module>
  File "/usr/lib/python2.7/dist-packages/lxml/html/__init__.py", line 532, in document_fromstring
    value = etree.fromstring(html, parser, **kw)
  File "lxml.etree.pyx", line 2754, in lxml.etree.fromstring (src/lxml/lxml.etree.c:54631)
  File "parser.pxi", line 1569, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:82659)
 ValueError: Unicode strings with encoding declaration are not supported.

code3:

import lxml.html
import feedparser
d = feedparser.parse('blog.xml')
for num,entry in enumerate(d.entries):
    string=entry.content[0]['value'].encode("utf-8") 
    myhtml=lxml.html.document_fromstring(string)
    print  myhtml.text_content()

It get a wrong output with code3.

 Traceback (most recent call last):
  File "<stdin>", line 3, in <module>
  File "/usr/lib/python2.7/dist-packages/lxml/html/__init__.py", line 532, in document_fromstring
    value = etree.fromstring(html, parser, **kw)
  File "lxml.etree.pyx", line 2754, in lxml.etree.fromstring (src/lxml/lxml.etree.c:54631)
  File "parser.pxi", line 1578, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:82748)
  File "parser.pxi", line 1457, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:81546)
  File "parser.pxi", line 965, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:78216)
  File "parser.pxi", line 569, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:74472)
  File "parser.pxi", line 650, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:75363)
  File "parser.pxi", line 599, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:74827)
 lxml.etree.XMLSyntaxError: line 1395: Tag b:include invalid

How to handle the encode in lxml to parse html-string properly?

like image

628

asked Apr 07 '13 11:04

showkey

2 Answers

There is a bug in lxml. Check output of this code:

import lxml.html
import feedparser

def test():
    try:
        lxml.html.document_fromstring('')
    except Exception as e:
        print e

d = feedparser.parse('blog.xml')
e = d.entries[0].content[0]['value'].encode('utf-8')

test() # XMLSyntaxError: None

lxml.html.document_fromstring(e)
test() # XMLSyntaxError: line 1407: Tag b:include invalid

So the error is confusing, the real reason why your parsing fails is that you pass empty strings to document_fromstring.

Try this code:

import lxml.html
import feedparser
d = feedparser.parse('blog.xml')
for num,entry in enumerate(d.entries):
    string=entry.content[0]['value'].encode("utf-8") 
    if not string:
        continue
    myhtml=lxml.html.document_fromstring(string)
    print  myhtml.text_content()

like image

131

answered Sep 20 '22 15:09

gatto

You could create yourself a parser, instead of using document_fromstring:

from cStringIO import StringIO
from lxml import etree

for num, entry in enumerate(d.entries):
    text = entry.content[0]['value'].encode('utf8')
    parser = etree.HTMLParser()
    tree   = etree.parse(StringIO(text), parser)
    print  ''.join(tree.xpath('.//text()'))

For Blogger.com Atom feed exports, this works to print the text content of the .content[0].value entry.

like image

22

answered Sep 21 '22 15:09

Martijn Pieters

Sign in to Comment

Related questions
                            
                                Picking up items progressivly as soon as a queue is available
                            
                                Python unicode string literals :: what's the difference between '\u0391' and u'\u0391'
                            
                                good merkle hash tree python implementation?
                            
                                How to get multiple parameters with same name from a URL in Pylons?
                            
                                Converting postgresql timestamp to JavaScript timestamp in Python
                            
                                Analogue of Python's OrderedDict?
                            
                                Correct usage of os.path and os.join
                            
                                How to do nonlinear complex root finding in Python
                            
                                How to parse html table with python and beautifulsoup and write to csv
                            
                                Detect if text in English with python [closed]
                            
                                Numpy Array Broadcasting with different dimensions
                            
                                Cython: unsigned int indices for numpy arrays gives different result
                            
                                How to deploy Flask+ Python application on Windows Azure?
                            
                                Are python "global" (module) variables thread local?
                            
                                OpenCV (cv2 in Python) VideoCapture not releasing camera after deletion
                            
                                PIL jpeg, how to preserve the pixel color
                            
                                sqlalchemy and double outerjoin
                            
                                Truncated versus floored division in Python
                            
                                How do you initialize a gensim corpus variable with a csr_matrix?
                            
                                Converting a list in a dict to a Series

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With