In the code snippet below, you can see that I am trying to scrape some data from the NCAA Men's Basketball website.
import requests
url = "https://www.ncaa.com/scoreboard/basketball-men/d1/"
response = requests.get(url)
html = response.text
print(html)
print(response.headers)
print("\n\n")
print(response.request.headers)
The website has a listing of games and their scores. I figured out how to pull all the data I need using Python Requests for the HTTP request and then BeautifulSoup for extracting data from the HTML. The full scraper is here if you'd like to take a look.
The problem: When Requests gets the response from the NCAA website, the data is much older (sometimes up to 30 or 40 minutes, at least) than the data on the actual website.
I've been Googling this for hours. After reading through the Python Requests docs, I believe I have discovered that the NCAA web server is sending outdated data. But I don't understand why it would send my program outdated data when it sends Google Chrome (or whatever web browser) the correct data.
The reason I believe the server is sending outdated data is that when I print the response headers, one of the items is 'Last-Modified': 'Sat, 26 Jan 2019 17:49:13 GMT' while another is 'Date': 'Sat, 26 Jan 2019 18:20:29 GMT', so it looks like the server gets the request at the right time, but provides data that hasn't been modified in a while.
My question: Do you know of any reason why this would happen? Is there something I need to add in my HTTP request that would get the server to send me data consistent with what is sends web browsers?
P.S. I am so sorry for the long question. I tried to keep it concise, yet still explain things clearly.
before your requests.get()
, try adding a header:
import requests
url = "https://www.ncaa.com/scoreboard/basketball-men/d1/"
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
response = requests.get(url, headers = headers)
html = response.text
My other suggestion would be to use:
url = 'https://data.ncaa.com/casablanca/scoreboard/basketball-men/d1/2019/01/26/scoreboard.json'
and use json package to read it. Everything is live and right there for you in a nice JSON format
Code
import json
import requests
url = 'https://data.ncaa.com/casablanca/scoreboard/basketball-men/d1/2019/01/26/scoreboard.json'
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
response = requests.get(url, headers = headers)
jsonStr = response.text
jsonObj = json.loads(jsonStr)
I checked, and the JSON object does return live scores/data. And all you need to do is change the date in the URL 2019/01/26
to get previous dates finished data for games.
EDIT - ADDITIONAL
This could help you pull out the data. Notice how I changed it to today's date to get the current data. It puts it in a nice dataframe for you:
from pandas.io.json import json_normalize
import json
import requests
url = 'https://data.ncaa.com/casablanca/scoreboard/basketball-men/d1/2019/01/27/scoreboard.json'
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
# Thanks to InfectedDrake wisdom, the following 3 lines that I previously had can be replaced by a single line. See below
#response = requests.get(url, headers = headers)
#jsonStr = response.text
#jsonObj = json.loads(jsonStr)
jsonObj = requests.get(url, headers = headers).json()
result = json_normalize(jsonObj['games'])
Try changing the user-agent in the request header to make it the same as your Google Chrome user-agent by adding this to your headers:
headers = {
'User-Agent': 'Add your google chrome user-agent here'
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With