I am currently trying to practice with the requests and BeautifulSoup Modules in Python 3.6 and have run into an issue that I can't seem to find any info on in other questions and answers.
It seems that at some point in the page, Beuatiful Soup stops recognizing tags and Ids. I am trying to pull Play-by-play data from a page like this:
http://www.pro-football-reference.com/boxscores/201609080den.htm
import requests, bs4
source_url = 'http://www.pro-football-reference.com/boxscores/201609080den.htm'
res = requests.get(source_url)
if '404' in res.url:
raise Exception('No data found for this link: '+source_url)
soup = bs4.BeautifulSoup(res.text,'html.parser')
#this works
all_pbp = soup.findAll('div', {'id' : 'all_pbp'})
print(len(all_pbp))
#this doesn't
table = soup.findAll('table', {'id' : 'pbp'})
print(len(table))
Using the inspector in Chrome, I can see that the table definitely exists. I have also tried to use it on 'div's and 'tr's in the later half of the HTML and it doesn't seem to work. I have tried the standard 'html.parser' as well as lxml and html5lib, but nothing seems to work.
Am I doing something wrong here, or is there something in the HTML or its formatting that prevents BeautifulSoup from correctly finding the later tags? I have run into issues with similar pages run by this company (hockey-reference.com, basketball-reference.com), but have been able to use these tools properly on other sites.
If it is something with the HTML, is there any better tool/library for helping to extract this info out there?
Thank you for your help, BF
BS4 won't be able to execute the javascript of a web page after doing the GET request for a URL. I think that the table of concern is loaded async from client-side javascript.
As a result, the client-side javascript will need to run first before scraping the HTML. This post describes how to do so!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With