I am completely new to web parsing with Python/BeautifulSoup. I have an HTML that has (part of) the code as follows:
<div id="pages">
<ul>
<li class="active"><a href="example.com">Example</a></li>
<li><a href="example.com">Example</a></li>
<li><a href="example1.com">Example 1</a></li>
<li><a href="example2.com">Example 2</a></li>
</ul>
</div>
I have to visit each link (basically each <li>
element) until there are no more <li>
tags present. Each time a link is clicked, its corresponding <li>
element gets class as 'active'. My code is:
from bs4 import BeautifulSoup
import urllib2
import re
landingPage = urllib2.urlopen('somepage.com').read()
soup = BeautifulSoup(landingPage)
pageList = soup.find("div", {"id": "pages"})
page = pageList.find("li", {"class": "active"})
This code gives me the first <li>
item in the list. My logic is I am keeping on checking if the next_sibling
is not None. If it is not None, I am creating an HTTP request to the href
attribute of the <a>
tag in that sibling <li>
. That would get me to the next page, and so on, till there are no more pages.
But I can't figure out how to get the next_sibling
of the page
variable given above. Is it page.next_sibling.get("href")
or something like that? I looked through the documentation, but somehow couldn't find it. Can someone help please?
Use find_next_sibling()
and be explicit about what sibling element do you want to find:
next_li_element = page.find_next_sibling("li")
next_li_element
would become None
if the page
corresponds to the last active li
:
if next_li_element is None:
# no more pages to go
Have you looked at dir(page)
or the documentation? If so, how did you miss .find_next_sibling()
?
from bs4 import BeautifulSoup
import urllib2
import re
landingPage = urllib2.urlopen('somepage.com').read()
soup = BeautifulSoup(landingPage)
pageList = soup.find("div", {"id": "pages"})
page = pageList.find("li", {"class": "active"})
sibling = page.find_next_sibling()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With