I was wondering what the easiest way to wrap an element with another element using lxml and Python for example if I have a html snippet:
<h1>The cool title</h1>
<p>Something Neat</p>
<table>
<tr>
<td>aaa</td>
<td>bbb</td>
</tr>
</table>
<p>The end of the snippet</p>
And I want to wrap the table element with a section element like this:
<h1>The cool title</h1>
<p>Something Neat</p>
<section>
<table>
<tr>
<td>aaa</td>
<td>bbb</td>
</tr>
</table>
</section>
<p>The end of the snippet</p>
Another thing I would like to do is scour the xml document for h1s with a certain attribute and then wrap all of the elements until the next h1 tag in an element for example:
<h1 class='neat'>Subject 1</h1>
<p>Here is a bunch of boring text</p>
<h2>Minor Heading</h2>
<p>Here is some more</p>
<h1 class='neat>Subject 2</h1>
<p>And Even More</p>
Converted to:
<section>
<h1 class='neat'>Subject 1</h1>
<p>Here is a bunch of boring text</p>
<h2>Minor Heading</h2>
<p>Here is some more</p>
</section>
<section>
<h1 class='neat>Subject 2</h1>
<p>And Even More</p>
</section>
Thanks for all the help, Chris
lxml is a Python library which allows for easy handling of XML and HTML files, and can also be used for web scraping. There are a lot of off-the-shelf XML parsers out there, but for better results, developers sometimes prefer to write their own XML and HTML parsers. This is when the lxml library comes to play.
lxml provides a very simple and powerful API for parsing XML and HTML. It supports one-step parsing as well as step-by-step parsing using an event-driven API (currently only for XML).
lxml has been downloaded from the Python Package Index millions of times and is also available directly in many package distributions, e.g. for Linux or macOS.
lxml's awesome for parsing well formed xml, but's not so good if you've got non-xhtml html. If that's the case then go for BeautifulSoup as suggested by systemizer.
With lxml, this is a fairly easy way to insert a section around all tables in the document:
import lxml.etree
TEST="<html><h1>...</html>"
def insert_section(root):
tables = root.findall(".//table")
for table in tables:
section = ET.Element("section")
table.addprevious(section)
section.insert(0, table) # this moves the table
root = ET.fromstring(TEST)
insert_section(root)
print ET.tostring(root)
You could do something similar to wrap the headings, but you would need to iterate through all the elements you want to wrap and move them to the section. element.index(child) and list slices might help here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With