Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Beautiful Soup Can't Find Tags

I am currently trying to practice with the requests and BeautifulSoup Modules in Python 3.6 and have run into an issue that I can't seem to find any info on in other questions and answers.

It seems that at some point in the page, Beuatiful Soup stops recognizing tags and Ids. I am trying to pull Play-by-play data from a page like this:

http://www.pro-football-reference.com/boxscores/201609080den.htm

import requests, bs4

source_url = 'http://www.pro-football-reference.com/boxscores/201609080den.htm'
res = requests.get(source_url)
if '404' in res.url:
    raise Exception('No data found for this link: '+source_url)

soup = bs4.BeautifulSoup(res.text,'html.parser')

#this works
all_pbp = soup.findAll('div', {'id' : 'all_pbp'})
print(len(all_pbp))

#this doesn't
table = soup.findAll('table', {'id' : 'pbp'})
print(len(table))

Using the inspector in Chrome, I can see that the table definitely exists. I have also tried to use it on 'div's and 'tr's in the later half of the HTML and it doesn't seem to work. I have tried the standard 'html.parser' as well as lxml and html5lib, but nothing seems to work.

Am I doing something wrong here, or is there something in the HTML or its formatting that prevents BeautifulSoup from correctly finding the later tags? I have run into issues with similar pages run by this company (hockey-reference.com, basketball-reference.com), but have been able to use these tools properly on other sites.

If it is something with the HTML, is there any better tool/library for helping to extract this info out there?

Thank you for your help, BF

like image 598
Big Fore Avatar asked Jul 02 '17 04:07

Big Fore


1 Answers

BS4 won't be able to execute the javascript of a web page after doing the GET request for a URL. I think that the table of concern is loaded async from client-side javascript.

As a result, the client-side javascript will need to run first before scraping the HTML. This post describes how to do so!

like image 136
qwertyuip9 Avatar answered Oct 13 '22 00:10

qwertyuip9