I am trying to parse https://www.tandfonline.com/toc/icbi20/current for the titles of all articles. The HTML
is divided into Volumes and Issues. Each Volume has an Issue that corresponds to a Month. So for Volume 36 there would be 12 Issues. In the current Volume (37) there are 4 Issues and I would like to parse through each Issue and get each Article's name.
To accomplish this and automate the search I need to fetch the href
links for each Issue. Initially I chose the parent's div
id
: id = 'tocList'
.
import requests
from bs4 import BeautifulSoup, SoupStrainer
chronobiology = requests.get("https://www.tandfonline.com/toc/icbi20/current")
chrono_coverpage = chronobiology.content
issues = SoupStrainer(id ='tocList')
issues_soup = BeautifulSoup(chrono_coverpage, 'html.parser', parse_only = issues)
for issue in issues_soup:
print(issue)
This returns a bs4 object BUT only with href
links from the Volume div
. What's worse is that this div
should encompass both Volume div
and Issue div
.
So, I decided trying to reduce my search space and make it more specific and chose the div
containing the Issue href
links (class_='issues'
)
This time Jupiter will think for a bit but won't return ANYTHING. Just blank. Nothing. Zippo. BUT if I ask what type of "nothing" has been returned, jupiter informs it is a "String
"??? I just don't know what to make of this.
So, firstly I had a question, why is it that the Issue div
element does not respond to the parsing?
When I try running print(BeautifulSoup(chrono_coverpage, 'html.parser').prettify())
the same occurs, the Issue div
does not appear (When Inspect Element
on the html
page it appears immediatly beneath the final Volume span
):
So I suspect that it must be javascript oriented or something, not so much HTML oriented. Or maybe the class = 'open'
has something to do with it.
Any clarifications would be kindly appreciated. Also, how would one parse through Javascripted links to get them?
Okay, so I've "resolved" the issue though I need to fill in some theoretical gaps:
Firstly this snippet holds the key to the beginning of solving the answer:
As can be seen, the <div class = 'container'>
is immediatly followed by a ::before
pseudo-element and the Links I am interested in are contained inside a div
immediatly beneath this pseudo-element. This last div
is then finished with the ::after
pseudo-element.
Firstly I realized that my problem was that I needed to select a pseudo-element. I found this to be quite impossible with BeutifulSoup
's soup.select()
since apparently BeautifulSoup
uses Soup Sieve
which "aims to allow users to target XML/HTML elements with CSS selectors. It implements many pseudo-classes [...]."
The last part of the paragraph states:
"Soup Sieve also will not match anything for pseudo classes that are only relevant in a live, browser environment, but it will gracefully handle them if they've been implemented;"
So this got me thinking that I have no idea what "pseudo classes that are only relevant in a live browser environment" means. But then I said to myself, "but it also said that had they been implemented, BS4 should be able to parse them". And since I can definitely see the div
elements containing my href
links of interest using the Inspect
tool, I though that I must be implemented.
The first part of that phrase got me thinking: "But do I need a live browser for this to work?"
So that brought me to Selenium
's web driver:
import requests
from bs4 import BeautifulSoup, SoupStrainer
from selenium import webdriver
driver = webdriver.Chrome()
url_chronobiology = driver.get("https://www.tandfonline.com/toc/icbi20/current")
chronobiology_content = driver.page_source
chronobiology_soup = BeautifulSoup(chronobiology_content)
chronobiology_soup.select('#tocList > div > div > div.yearContent > div.issues > div > div')
[Out]: []
Clearly this result made me sad because I thought I had understood what was going on. But then I though that if I 'clicked' one of the issues, from the previously opened browser that it would work (for some reason, to be honest I'm pretty sure desperation led me to that thought).
Well, surprise surprise. It worked: After clicking on the "Issue 4" and re running the script
, I got what I was looking for:
UNANSWERED QUESTIONS?
1 - Apparently these pseudo-elements only "exist" when clicked upon, because otherwise the code doesn't recognize they are there. Why?
2 - What code must be run in order to make an initial click and activiate these pseudo-elements so the code can automatically open these links and parse the information I want? (title of articles)
UPDATE
Question 2 is answered using Selenium's ActionChain:
import requests
from bs4 import BeautifulSoup, SoupStrainer
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
driver = webdriver.Chrome()
url_chronobiology = driver.get("https://www.tandfonline.com/toc/icbi20/current")
chronobiology_content = driver.page_source
chronobiology_soup = BeautifulSoup(chronobiology_content)
action=ActionChains(driver)
action.move_to_element(driver.find_element_by_xpath('//*[@id="tocList"]/div/div/div[3]/div[2]/div')).perform()
chronobiology_soup.select('#tocList > div > div > div.yearContent > div.issues > div > div')
[Out]:
[<div class="loi-issues-scroller">
<a class="open" href="/toc/icbi20/37/4?nav=tocList">Issue<span>4</span></a>
<a class="" href="/toc/icbi20/37/3?nav=tocList">Issue<span>3</span></a>
<a class="" href="/toc/icbi20/37/2?nav=tocList">Issue<span>2</span></a>
<a class="" href="/toc/icbi20/37/1?nav=tocList">Issue<span>1</span></a>
</div>]
The only downside is that one must stay on the page for Selenium
's ActionChain.perform()
can actually click the element, but at least I've automated this step.
If someone could answer question 1 that would be great
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With