Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python lxml wrapping elements

I was wondering what the easiest way to wrap an element with another element using lxml and Python for example if I have a html snippet:

<h1>The cool title</h1>
<p>Something Neat</p>
<table>
<tr>
<td>aaa</td>
<td>bbb</td>
</tr>
</table>
<p>The end of the snippet</p>

And I want to wrap the table element with a section element like this:

<h1>The cool title</h1>
<p>Something Neat</p>
<section>
<table>
<tr>
<td>aaa</td>
<td>bbb</td>
</tr>
</table>
</section>
<p>The end of the snippet</p>

Another thing I would like to do is scour the xml document for h1s with a certain attribute and then wrap all of the elements until the next h1 tag in an element for example:

<h1 class='neat'>Subject 1</h1>
<p>Here is a bunch of boring text</p>
<h2>Minor Heading</h2>
<p>Here is some more</p>
<h1 class='neat>Subject 2</h1>
<p>And Even More</p>

Converted to:

<section>
<h1 class='neat'>Subject 1</h1>
<p>Here is a bunch of boring text</p>
<h2>Minor Heading</h2>
<p>Here is some more</p>
</section>
<section>
<h1 class='neat>Subject 2</h1>
<p>And Even More</p>
</section>

Thanks for all the help, Chris

like image 334
Chris Avatar asked May 18 '11 00:05

Chris


People also ask

What does lxml do in Python?

lxml is a Python library which allows for easy handling of XML and HTML files, and can also be used for web scraping. There are a lot of off-the-shelf XML parsers out there, but for better results, developers sometimes prefer to write their own XML and HTML parsers. This is when the lxml library comes to play.

What is lxml parser in Python?

lxml provides a very simple and powerful API for parsing XML and HTML. It supports one-step parsing as well as step-by-step parsing using an event-driven API (currently only for XML).

Is lxml included in Python?

lxml has been downloaded from the Python Package Index millions of times and is also available directly in many package distributions, e.g. for Linux or macOS.


1 Answers

lxml's awesome for parsing well formed xml, but's not so good if you've got non-xhtml html. If that's the case then go for BeautifulSoup as suggested by systemizer.

With lxml, this is a fairly easy way to insert a section around all tables in the document:

import lxml.etree

TEST="<html><h1>...</html>"

def insert_section(root):
    tables = root.findall(".//table")
    for table in tables:
        section = ET.Element("section")
        table.addprevious(section)
        section.insert(0, table)   # this moves the table

root = ET.fromstring(TEST)
insert_section(root)
print ET.tostring(root)

You could do something similar to wrap the headings, but you would need to iterate through all the elements you want to wrap and move them to the section. element.index(child) and list slices might help here.

like image 92
3 revs Avatar answered Sep 22 '22 17:09

3 revs