Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

BeautifulSoup not reading entire HTML obtained by requests

I am trying to scrape data from a table of sporting statistics presented as HTML using the BeautifulSoup and requests libraries. I am running both of them on Python 3.5. I seem to be successfully obtaining the HTML via requests because when I display r.content, the full HTML of the website I am trying to scrape is displayed. However, when I pass this to BeautifulSoup, BeautifulSoup drops the bulk of the HTML which are the tables of statistics themselves.

If you take a look at the website in question, the HTML from "Scoring Progression" onward is dropped.

I think the problem relates to the pieces of HTML which are included between brackets ('[' and ']') but I have not been able to develop a workaround. I have tried the html, lxml and html5lib parsers for BeautifulSoup, to no avail. I have also tried providing 'User-Agent' headers and that did not work either.

My code is as below. For brevity's sake I have not included the output.

import requests
from bs4 import BeautifulSoup

r = requests.get('http://afltables.com/afl/stats/games/2015/031420150402.html')

soup = BeautifulSoup(r.content, 'html5lib')

print(soup)
like image 236
nijawa Avatar asked Nov 08 '22 18:11

nijawa


1 Answers

I used a different parser and it seemed to work; just the default html parser.

from bs4 import BeautifulSoup
from urllib.request import urlopen as uReq

url = 'http://afltables.com/afl/stats/games/2015/031420150402.html'
client = uReq(url)  # grabs the page
soup = BeautifulSoup(client.read(), 'html.parser')  # using the default html parser
tables = soup.find_all('table')  # gets all the tables
print(tables[7])  # scoring progression table, the 8th's table

Though if you had tried something like "soup.table" without having used "find_all" clause first, it would seem like it dropped the other tables since it only returns the first table.

like image 80
Data Science Dojo Avatar answered Nov 14 '22 23:11

Data Science Dojo