I have a simple html file like this. In fact I pulled it from a wiki page, removed some html attributes and converted to this simple html page.
<html>
<body>
<h1>draw electronics schematics</h1>
<h2>first header</h2>
<p>
<!-- ..some text images -->
</p>
<h3>some header</h3>
<p>
<!-- ..some image -->
</p>
<p>
<!-- ..some text -->
</p>
<h2>second header</h2>
<p>
<!-- ..again some text and images -->
</p>
</body>
</html>
I read this html file using python and beautiful soup like this.
from bs4 import BeautifulSoup
soup = BeautifulSoup(open("test.html"))
pages = []
What I'd like to do is split this html page into two parts. The first part will be between first header and second header. And the second part will be between second header <h2> and </body> tags. Then I'd like to store them in a list eg. pages. So I'd be able to create multiple pages from an html page according to <h2> tags.
Any ideas on how should I do this? Thanks..
You can't, at least not in flat-HTML.
Look for the h2
tags, then use .next_sibling
to grab everything until it's another h2
tag:
soup = BeautifulSoup(open("test.html"))
pages = []
h2tags = soup.find_all('h2')
def next_element(elem):
while elem is not None:
# Find next element, skip NavigableString objects
elem = elem.next_sibling
if hasattr(elem, 'name'):
return elem
for h2tag in h2tags:
page = [str(h2tag)]
elem = next_element(h2tag)
while elem and elem.name != 'h2':
page.append(str(elem))
elem = next_element(elem)
pages.append('\n'.join(page))
Using your sample, this gives:
>>> pages
['<h2>first header</h2>\n<p>\n<!-- ..some text images -->\n</p>\n<h3>some header</h3>\n<p>\n<!-- ..some image -->\n</p>\n<p>\n<!-- ..some text -->\n</p>', '<h2>second header</h2>\n<p>\n<!-- ..again some text and images -->\n</p>']
>>> print pages[0]
<h2>first header</h2>
<p>
<!-- ..some text images -->
</p>
<h3>some header</h3>
<p>
<!-- ..some image -->
</p>
<p>
<!-- ..some text -->
</p>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With