Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BeautifulSoup can't find class that exists on webpage?

So I am trying to scrape the following webpage https://www.scoreboard.com/uk/football/england/premier-league/,

Specifically the scheduled and finished results. Thus I am trying to look for the elements with class = "stage-finished" or "stage-scheduled". However when I scrape the webpage and print out what page_soup contains, it doesn't contain these elements.

I found another SO question with an answer saying that this is because it is loaded via AJAX and I need to look at the XHR under the network tab on chrome dev tools to find the file thats loading the necessary data, however it doesn't seem to be there?

import bs4
import requests
from bs4 import BeautifulSoup as soup
import csv
import datetime

myurl = "https://www.scoreboard.com/uk/football/england/premier-league/"
headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
page = requests.get(myurl, headers=headers)

page_soup = soup(page.content, "html.parser")

scheduled = page_soup.select(".stage-scheduled")
finished = page_soup.select(".stage-finished")
live = page_soup.select(".stage-live")
print(page_soup)
print(scheduled[0])

The above code throws an error of course as there is no content in the scheduled array.

My question is, how do I go about getting the data I'm looking for?

I copied the contents of the XHR files to a notepad and searched for stage-finished and other tags and found nothing. Am I missing something easy here?

like image 907
Danny Avatar asked Sep 19 '18 14:09

Danny


People also ask

Which BeautifulSoup method can find all the instances of a tag on a page?

In order to print all the heading tags using BeautifulSoup, we use the find_all() method. The find_all method is one of the most common methods in BeautifulSoup. It looks through a tag and retrieves all the occurrences of that tag.

What BeautifulSoup will return if you attempt to access an IMG tag that does not exist?

If you attempt to access a tag that does not exist, BeautifulSoup will return a None object.


1 Answers

The page is JavaScript rendered. You need Selenium. Here is some code to start on:

from selenium import webdriver

url = 'https://www.scoreboard.com/uk/football/england/premier-league/'

driver = webdriver.Chrome()
driver.get(url)
stages = driver.find_elements_by_class_name('stage-scheduled')
driver.close()

Or you could pass driver.content in to the BeautifulSoup method. Like this:

soup = BeautifulSoup(driver.page_source, 'html.parser')

Note: You need to install a webdriver first. I installed chromedriver.

Good luck!

like image 167
teller.py3 Avatar answered Nov 12 '22 06:11

teller.py3