I'm working on a script using lxml.html to parse web pages. I have done a fair bit of BeautifulSoup in my time but am now experimenting with lxml due to its speed.
I would like to know what the most sensible way in the library is to do the equivalent of Javascript's InnerHtml - that is, to retrieve or set the complete contents of a tag.
<body> <h1>A title</h1> <p>Some text</p> </body>
InnerHtml is therefore:
<h1>A title</h1> <p>Some text</p>
I can do it using hacks (converting to string/regexes etc) but I'm assuming that there is a correct way to do this using the library which I am missing due to unfamiliarity. Thanks for any help.
EDIT: Thanks to pobk for showing me the way on this so quickly and effectively. For anyone trying the same, here is what I ended up with:
from lxml import html from cStringIO import StringIO t = html.parse(StringIO( """<body> <h1>A title</h1> <p>Some text</p> Untagged text <p> Unclosed p tag </body>""")) root = t.getroot() body = root.body print (element.text or '') + ''.join([html.tostring(child) for child in body.iterdescendants()])
Note that the lxml.html parser will fix up the unclosed tag, so beware if this is a problem.
lxml provides a very simple and powerful API for parsing XML and HTML. It supports one-step parsing as well as step-by-step parsing using an event-driven API (currently only for XML).
lxml is also a similar parser but driven by XML features than HTML. It has dependency on external C libraries. It is faster as compared to html5lib. Lets observe the difference in behavior of these two parsers by taking a sample tag example and see the output.
lxml is a Python library which allows for easy handling of XML and HTML files, and can also be used for web scraping. There are a lot of off-the-shelf XML parsers out there, but for better results, developers sometimes prefer to write their own XML and HTML parsers.
Sorry for bringing this up again, but I've been looking for a solution and yours contains a bug:
<body>This text is ignored <h1>Title</h1><p>Some text</p></body>
Text directly under the root element is ignored. I ended up doing this:
(body.text or '') +\ ''.join([html.tostring(child) for child in body.iterchildren()])
You can get the children of an ElementTree node using the getchildren() or iterdescendants() methods of the root node:
>>> from lxml import etree >>> from cStringIO import StringIO >>> t = etree.parse(StringIO("""<body> ... <h1>A title</h1> ... <p>Some text</p> ... </body>""")) >>> root = t.getroot() >>> for child in root.iterdescendants(),: ... print etree.tostring(child) ... <h1>A title</h1> <p>Some text</p>
This can be shorthanded as follows:
print ''.join([etree.tostring(child) for child in root.iterdescendants()])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With