Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python Requests Module Not Getting Latest Data from Web Server

In the code snippet below, you can see that I am trying to scrape some data from the NCAA Men's Basketball website.

import requests

url = "https://www.ncaa.com/scoreboard/basketball-men/d1/"

response = requests.get(url)
html = response.text

print(html)
print(response.headers)
print("\n\n")
print(response.request.headers)

The website has a listing of games and their scores. I figured out how to pull all the data I need using Python Requests for the HTTP request and then BeautifulSoup for extracting data from the HTML. The full scraper is here if you'd like to take a look.

The problem: When Requests gets the response from the NCAA website, the data is much older (sometimes up to 30 or 40 minutes, at least) than the data on the actual website.

I've been Googling this for hours. After reading through the Python Requests docs, I believe I have discovered that the NCAA web server is sending outdated data. But I don't understand why it would send my program outdated data when it sends Google Chrome (or whatever web browser) the correct data.

The reason I believe the server is sending outdated data is that when I print the response headers, one of the items is 'Last-Modified': 'Sat, 26 Jan 2019 17:49:13 GMT' while another is 'Date': 'Sat, 26 Jan 2019 18:20:29 GMT', so it looks like the server gets the request at the right time, but provides data that hasn't been modified in a while.

My question: Do you know of any reason why this would happen? Is there something I need to add in my HTTP request that would get the server to send me data consistent with what is sends web browsers?

P.S. I am so sorry for the long question. I tried to keep it concise, yet still explain things clearly.

like image 359
Joseph Avatar asked Feb 03 '23 20:02

Joseph


2 Answers

before your requests.get(), try adding a header:

import requests

url = "https://www.ncaa.com/scoreboard/basketball-men/d1/"

headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}   


response = requests.get(url, headers = headers)
html = response.text

My other suggestion would be to use:

url = 'https://data.ncaa.com/casablanca/scoreboard/basketball-men/d1/2019/01/26/scoreboard.json'

and use json package to read it. Everything is live and right there for you in a nice JSON format

Code

import json
import requests

url = 'https://data.ncaa.com/casablanca/scoreboard/basketball-men/d1/2019/01/26/scoreboard.json'
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}    

response = requests.get(url, headers = headers)

jsonStr = response.text

jsonObj = json.loads(jsonStr)

I checked, and the JSON object does return live scores/data. And all you need to do is change the date in the URL 2019/01/26 to get previous dates finished data for games.


EDIT - ADDITIONAL

This could help you pull out the data. Notice how I changed it to today's date to get the current data. It puts it in a nice dataframe for you:

from pandas.io.json import json_normalize
import json
import requests

url = 'https://data.ncaa.com/casablanca/scoreboard/basketball-men/d1/2019/01/27/scoreboard.json'
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}    

# Thanks to InfectedDrake wisdom, the following 3 lines that I previously had can be replaced by a single line. See below
#response = requests.get(url, headers = headers)
#jsonStr = response.text
#jsonObj = json.loads(jsonStr)

jsonObj = requests.get(url, headers = headers).json()

result = json_normalize(jsonObj['games'])
like image 102
chitown88 Avatar answered Feb 06 '23 14:02

chitown88


Try changing the user-agent in the request header to make it the same as your Google Chrome user-agent by adding this to your headers:

headers = {
    'User-Agent': 'Add your google chrome user-agent here'
}
like image 21
s. wolfe Avatar answered Feb 06 '23 15:02

s. wolfe