Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Access next sibling <li> element with BeautifulSoup

I am completely new to web parsing with Python/BeautifulSoup. I have an HTML that has (part of) the code as follows:

<div id="pages">
    <ul>
        <li class="active"><a href="example.com">Example</a></li>
        <li><a href="example.com">Example</a></li>
        <li><a href="example1.com">Example 1</a></li>
        <li><a href="example2.com">Example 2</a></li>
    </ul>
</div>

I have to visit each link (basically each <li> element) until there are no more <li> tags present. Each time a link is clicked, its corresponding <li> element gets class as 'active'. My code is:

from bs4 import BeautifulSoup
import urllib2
import re

landingPage = urllib2.urlopen('somepage.com').read()
soup = BeautifulSoup(landingPage)

pageList = soup.find("div", {"id": "pages"})

page = pageList.find("li", {"class": "active"})

This code gives me the first <li> item in the list. My logic is I am keeping on checking if the next_sibling is not None. If it is not None, I am creating an HTTP request to the href attribute of the <a> tag in that sibling <li>. That would get me to the next page, and so on, till there are no more pages.

But I can't figure out how to get the next_sibling of the page variable given above. Is it page.next_sibling.get("href") or something like that? I looked through the documentation, but somehow couldn't find it. Can someone help please?

like image 798
user3033194 Avatar asked Feb 01 '16 22:02

user3033194


2 Answers

Use find_next_sibling() and be explicit about what sibling element do you want to find:

next_li_element = page.find_next_sibling("li")

next_li_element would become None if the page corresponds to the last active li:

if next_li_element is None:
    # no more pages to go
like image 134
alecxe Avatar answered Oct 11 '22 18:10

alecxe


Have you looked at dir(page) or the documentation? If so, how did you miss .find_next_sibling()?

from bs4 import BeautifulSoup
import urllib2
import re

landingPage = urllib2.urlopen('somepage.com').read()
soup = BeautifulSoup(landingPage)

pageList = soup.find("div", {"id": "pages"})

page = pageList.find("li", {"class": "active"})
sibling = page.find_next_sibling()
like image 35
L3viathan Avatar answered Oct 11 '22 18:10

L3viathan