Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Equivalent to InnerHTML when using lxml.html to parse HTML

I'm working on a script using lxml.html to parse web pages. I have done a fair bit of BeautifulSoup in my time but am now experimenting with lxml due to its speed.

I would like to know what the most sensible way in the library is to do the equivalent of Javascript's InnerHtml - that is, to retrieve or set the complete contents of a tag.

<body> <h1>A title</h1> <p>Some text</p> </body> 

InnerHtml is therefore:

<h1>A title</h1> <p>Some text</p> 

I can do it using hacks (converting to string/regexes etc) but I'm assuming that there is a correct way to do this using the library which I am missing due to unfamiliarity. Thanks for any help.

EDIT: Thanks to pobk for showing me the way on this so quickly and effectively. For anyone trying the same, here is what I ended up with:

from lxml import html from cStringIO import StringIO t = html.parse(StringIO( """<body> <h1>A title</h1> <p>Some text</p> Untagged text <p> Unclosed p tag </body>""")) root = t.getroot() body = root.body print (element.text or '') + ''.join([html.tostring(child) for child in body.iterdescendants()]) 

Note that the lxml.html parser will fix up the unclosed tag, so beware if this is a problem.

like image 773
somewhatoff Avatar asked May 25 '11 10:05

somewhatoff


People also ask

Can lxml parse HTML?

lxml provides a very simple and powerful API for parsing XML and HTML. It supports one-step parsing as well as step-by-step parsing using an event-driven API (currently only for XML).

What is the difference between HTML parser and lxml?

lxml is also a similar parser but driven by XML features than HTML. It has dependency on external C libraries. It is faster as compared to html5lib. Lets observe the difference in behavior of these two parsers by taking a sample tag example and see the output.

What is lxml parser?

lxml is a Python library which allows for easy handling of XML and HTML files, and can also be used for web scraping. There are a lot of off-the-shelf XML parsers out there, but for better results, developers sometimes prefer to write their own XML and HTML parsers.


2 Answers

Sorry for bringing this up again, but I've been looking for a solution and yours contains a bug:

<body>This text is ignored <h1>Title</h1><p>Some text</p></body> 

Text directly under the root element is ignored. I ended up doing this:

(body.text or '') +\ ''.join([html.tostring(child) for child in body.iterchildren()]) 
like image 112
lormus Avatar answered Oct 15 '22 09:10

lormus


You can get the children of an ElementTree node using the getchildren() or iterdescendants() methods of the root node:

>>> from lxml import etree >>> from cStringIO import StringIO >>> t = etree.parse(StringIO("""<body> ... <h1>A title</h1> ... <p>Some text</p> ... </body>""")) >>> root = t.getroot() >>> for child in root.iterdescendants(),: ...  print etree.tostring(child) ... <h1>A title</h1>  <p>Some text</p> 

This can be shorthanded as follows:

print ''.join([etree.tostring(child) for child in root.iterdescendants()]) 
like image 22
pobk Avatar answered Oct 15 '22 08:10

pobk