I am trying to scrape data from a table of sporting statistics presented as HTML using the BeautifulSoup and requests libraries. I am running both of them on Python 3.5. I seem to be successfully obtaining the HTML via requests because when I display r.content
, the full HTML of the website I am trying to scrape is displayed. However, when I pass this to BeautifulSoup, BeautifulSoup drops the bulk of the HTML which are the tables of statistics themselves.
If you take a look at the website in question, the HTML from "Scoring Progression" onward is dropped.
I think the problem relates to the pieces of HTML which are included between brackets ('[' and ']') but I have not been able to develop a workaround. I have tried the html, lxml and html5lib parsers for BeautifulSoup, to no avail. I have also tried providing 'User-Agent' headers and that did not work either.
My code is as below. For brevity's sake I have not included the output.
import requests
from bs4 import BeautifulSoup
r = requests.get('http://afltables.com/afl/stats/games/2015/031420150402.html')
soup = BeautifulSoup(r.content, 'html5lib')
print(soup)
I used a different parser and it seemed to work; just the default html parser.
from bs4 import BeautifulSoup
from urllib.request import urlopen as uReq
url = 'http://afltables.com/afl/stats/games/2015/031420150402.html'
client = uReq(url) # grabs the page
soup = BeautifulSoup(client.read(), 'html.parser') # using the default html parser
tables = soup.find_all('table') # gets all the tables
print(tables[7]) # scoring progression table, the 8th's table
Though if you had tried something like "soup.table" without having used "find_all" clause first, it would seem like it dropped the other tables since it only returns the first table.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With