How to split a html page to multiple pages using python and beautiful soup

Tags:

I have a simple html file like this. In fact I pulled it from a wiki page, removed some html attributes and converted to this simple html page.

<html>
   <body>
      <h1>draw electronics schematics</h1>
      <h2>first header</h2>
      <p>
         <!-- ..some text images -->
      </p>
      <h3>some header</h3>
      <p>
         <!-- ..some image -->
      </p>
      <p>
         <!-- ..some text -->
      </p>
      <h2>second header</h2>
      <p>
         <!-- ..again some text and images -->
      </p>
   </body>
</html>

I read this html file using python and beautiful soup like this.

from bs4 import BeautifulSoup

soup = BeautifulSoup(open("test.html"))

pages = []

What I'd like to do is split this html page into two parts. The first part will be between first header and second header. And the second part will be between second header <h2> and </body> tags. Then I'd like to store them in a list eg. pages. So I'd be able to create multiple pages from an html page according to <h2> tags.

Any ideas on how should I do this? Thanks..

225

asked Jan 21 '13 18:01

Erdem

1 Answers

Look for the h2 tags, then use .next_sibling to grab everything until it's another h2 tag:

soup = BeautifulSoup(open("test.html"))
pages = []
h2tags = soup.find_all('h2')

def next_element(elem):
    while elem is not None:
        # Find next element, skip NavigableString objects
        elem = elem.next_sibling
        if hasattr(elem, 'name'):
            return elem

for h2tag in h2tags:
    page = [str(h2tag)]
    elem = next_element(h2tag)
    while elem and elem.name != 'h2':
        page.append(str(elem))
        elem = next_element(elem)
    pages.append('\n'.join(page))

Using your sample, this gives:

>>> pages
['<h2>first header</h2>\n<p>\n<!-- ..some text images -->\n</p>\n<h3>some header</h3>\n<p>\n<!-- ..some image -->\n</p>\n<p>\n<!-- ..some text -->\n</p>', '<h2>second header</h2>\n<p>\n<!-- ..again some text and images -->\n</p>']
>>> print pages[0]
<h2>first header</h2>
<p>
<!-- ..some text images -->
</p>
<h3>some header</h3>
<p>
<!-- ..some image -->
</p>
<p>
<!-- ..some text -->
</p>

125

answered Sep 22 '22 16:09

Martijn Pieters

Related questions
                            
                                Cookies using Python and Google App Engine
                            
                                python string ' " ' : single double quote inside string
                            
                                How to do a while ( x < y ) in jinja2
                            
                                Python- How to find the average of multiple values/key in a dictionary
                            
                                Processing only non-blank lines
                            
                                Sending form data to aspx page
                            
                                efficiently computing parafac / CP product in numpy
                            
                                Iterative deletion from list (Python 2)
                            
                                reconstruction figure legend in pandas
                            
                                Explanation of the token-based password-reset functionality in Flask-Security
                            
                                Why are SIP and PyQt4 not getting along
                            
                                How to form an anonymous request to Imgur's APIv3
                            
                                Errors while solving ODE's python
                            
                                Python psycopg2 - Logging events
                            
                                functools.wraps won't let me wrap a function with a class in Python 3
                            
                                numpy binary raster image to polygon transformation
                            
                                Assign a variable into `g` once and only once for application in Flask
                            
                                Python - How to sort a list of colors based on a color's "distance" from a source color in 3D (r, g, b) space?
                            
                                Flask-Admin - Customizing views
                            
                                Customizing Mezzanine

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to split a html page to multiple pages using python and beautiful soup

Tags:

python

html

beautifulsoup

Erdem

People also ask

1 Answers

Martijn Pieters

Recent Activity

Donate For Us