So I am trying to scrape a table from a specific website using BeautifulSoup and urllib. My goal is to create a single list from all the data in this table. I have tried using this same code using tables from other websites, and it works fine. However, while trying it with this website the table returns a NoneType object. Can someone help me with this? I've tried looking for other answers online but I'm not having much luck.
Here's the code:
import requests
import urllib
from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib.request.urlopen("http://www.teamrankings.com/ncaa-basketball/stat/free-throw-pct").read())
table = soup.find("table", attrs={'class':'sortable'})
data = []
rows = table.findAll("tr")
for tr in rows:
cols = tr.findAll("td")
for td in cols:
text = ''.join(td.find(text=True))
data.append(text)
print(data)
Beautiful Soup is a very powerful library that makes web scraping by traversing the DOM (document object model) easier to implement. But it does only static scraping. Static scraping ignores JavaScript. It fetches web pages from the server without the help of a browser.
For web scraping to work in Python, we're going to perform three basic steps: Extract the HTML content using the requests library. Analyze the HTML structure and identify the tags which have our content. Extract the tags using Beautiful Soup and put the data in a Python list.
First, install Cheerio and Axios by running the following command: npm install cheerio axios . Then create a new file called crawler. js and copy/paste the following code: const axios = require('axios'); const cheerio = require('cheerio'); const getPostTitles = async () => { try { const { data } = await axios.
Alternatively, we could also use BeautifulSoup on the rendered HTML (see below). However, the awesome point here is that we can create the connection to this webpage, render its JavaScript, and parse out the resultant HTML all in one package!
It looks like this data is loaded via an ajax call:
You should target that url instead: http://www.teamrankings.com/ajax/league/v3/stats_controller.php
import requests
import urllib
from bs4 import BeautifulSoup
params = {
"type":"team-detail",
"league":"ncb",
"stat_id":"3083",
"season_id":"312",
"cat_type":"2",
"view":"stats_v1",
"is_previous":"0",
"date":"04/06/2015"
}
content = urllib.request.urlopen("http://www.teamrankings.com/ajax/league/v3/stats_controller.php",data=urllib.parse.urlencode(params).encode('utf8')).read()
soup = BeautifulSoup(content)
table = soup.find("table", attrs={'class':'sortable'})
data = []
rows = table.findAll("tr")
for tr in rows:
cols = tr.findAll("td")
for td in cols:
text = ''.join(td.find(text=True))
data.append(text)
print(data)
Using your web inspector you can also view the parameters that are passed along with the POST request.
Generally the server on the other end will check for these values and reject your request if you do not have some or all of them. The above code snippet ran fine for me. I switched to urllib2
because I generally prefer to use that library.
If the data loads in your browser it is possible to scrape it. You just need to mimic the request your browser sends.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With