Equivalent to InnerHTML when using lxml.html to parse HTML

Tags:

I'm working on a script using lxml.html to parse web pages. I have done a fair bit of BeautifulSoup in my time but am now experimenting with lxml due to its speed.

I would like to know what the most sensible way in the library is to do the equivalent of Javascript's InnerHtml - that is, to retrieve or set the complete contents of a tag.

<body> <h1>A title</h1> <p>Some text</p> </body>

InnerHtml is therefore:

<h1>A title</h1> <p>Some text</p>

I can do it using hacks (converting to string/regexes etc) but I'm assuming that there is a correct way to do this using the library which I am missing due to unfamiliarity. Thanks for any help.

EDIT: Thanks to pobk for showing me the way on this so quickly and effectively. For anyone trying the same, here is what I ended up with:

from lxml import html from cStringIO import StringIO t = html.parse(StringIO( """<body> <h1>A title</h1> <p>Some text</p> Untagged text <p> Unclosed p tag </body>""")) root = t.getroot() body = root.body print (element.text or '') + ''.join([html.tostring(child) for child in body.iterdescendants()])

Note that the lxml.html parser will fix up the unclosed tag, so beware if this is a problem.

773

asked May 25 '11 10:05

somewhatoff

2 Answers

Sorry for bringing this up again, but I've been looking for a solution and yours contains a bug:

<body>This text is ignored <h1>Title</h1><p>Some text</p></body>

Text directly under the root element is ignored. I ended up doing this:

(body.text or '') +\ ''.join([html.tostring(child) for child in body.iterchildren()])

112

answered Oct 15 '22 09:10

lormus

You can get the children of an ElementTree node using the getchildren() or iterdescendants() methods of the root node:

>>> from lxml import etree >>> from cStringIO import StringIO >>> t = etree.parse(StringIO("""<body> ... <h1>A title</h1> ... <p>Some text</p> ... </body>""")) >>> root = t.getroot() >>> for child in root.iterdescendants(),: ...  print etree.tostring(child) ... <h1>A title</h1>  <p>Some text</p>

This can be shorthanded as follows:

print ''.join([etree.tostring(child) for child in root.iterdescendants()])

answered Oct 15 '22 08:10

pobk

Related questions
                            
                                how to test for a regex match
                            
                                What is python's equivalent of R's NA?
                            
                                What are the differences between mysql-connector-python, mysql-connector-python-rf and mysql-connector-repackaged?
                            
                                Why is a False value (0) smaller in bytes than True (1)?
                            
                                What is the difference between jedi and python language server in VS code IDE?
                            
                                How to override the default value of a Model Field from an Abstract Base Class
                            
                                How do I wrap a C++ class with Cython?
                            
                                Regular Expressions in Python unexpectedly slow
                            
                                Python CLI program unit testing
                            
                                How to get PyCharm to check PEP8 code style?
                            
                                python asyncio, how to create and cancel tasks from another thread
                            
                                Keras Maxpooling2d layer gives ValueError
                            
                                How to install Tensorflow on Python 2.7 on Windows?
                            
                                When to use utf8 as a header in py files
                            
                                What does the landing time mean in airflow?
                            
                                CS231n: How to calculate gradient for Softmax loss function?
                            
                                How can I test a .tflite model to prove that it behaves as the original model using the same Test Data?
                            
                                Max Number of unique substrings from a partition
                            
                                Python recursive function error: "maximum recursion depth exceeded" [duplicate]
                            
                                In Python, is use of `del` statement a code smell?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Equivalent to InnerHTML when using lxml.html to parse HTML

Tags:

python

parsing

lxml

somewhatoff

People also ask

2 Answers

lormus

pobk

Recent Activity

Donate For Us