So I am trying to scrape the following webpage https://www.scoreboard.com/uk/football/england/premier-league/,
Specifically the scheduled and finished results. Thus I am trying to look for the elements with class = "stage-finished" or "stage-scheduled"
. However when I scrape the webpage and print out what page_soup contains, it doesn't contain these elements.
I found another SO question with an answer saying that this is because it is loaded via AJAX and I need to look at the XHR under the network tab on chrome dev tools to find the file thats loading the necessary data, however it doesn't seem to be there?
import bs4
import requests
from bs4 import BeautifulSoup as soup
import csv
import datetime
myurl = "https://www.scoreboard.com/uk/football/england/premier-league/"
headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
page = requests.get(myurl, headers=headers)
page_soup = soup(page.content, "html.parser")
scheduled = page_soup.select(".stage-scheduled")
finished = page_soup.select(".stage-finished")
live = page_soup.select(".stage-live")
print(page_soup)
print(scheduled[0])
The above code throws an error of course as there is no content in the scheduled array.
My question is, how do I go about getting the data I'm looking for?
I copied the contents of the XHR files to a notepad and searched for stage-finished and other tags and found nothing. Am I missing something easy here?
In order to print all the heading tags using BeautifulSoup, we use the find_all() method. The find_all method is one of the most common methods in BeautifulSoup. It looks through a tag and retrieves all the occurrences of that tag.
If you attempt to access a tag that does not exist, BeautifulSoup will return a None object.
The page is JavaScript rendered. You need Selenium. Here is some code to start on:
from selenium import webdriver
url = 'https://www.scoreboard.com/uk/football/england/premier-league/'
driver = webdriver.Chrome()
driver.get(url)
stages = driver.find_elements_by_class_name('stage-scheduled')
driver.close()
Or you could pass driver.content
in to the BeautifulSoup
method. Like this:
soup = BeautifulSoup(driver.page_source, 'html.parser')
Note: You need to install a webdriver first. I installed chromedriver.
Good luck!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With