I am trying to web scrape the content of few of the websites. But I noticed that for some of the websites I am getting the response with status code as 200. However, for some other of them I am getting 404 status code with the response. But when I am opening these websites (returning 404) in the browser, it is loading fine for me. What am I missing here?
For example:
import requests
url_1 = "https://www.transfermarkt.com/jumplist/startseite/wettbewerb/GB1"
url_2 = "https://stackoverflow.com/questions/36516183/what-should-i-use-instead-of-urlopen-in-urllib3"
page_t = requests.get(url_2)
print(page_t.status_code) #Getting a Not Found page and 404 status
page = requests.get(url_1)
print(page.status_code) #Getting a Valid HTML page and 200 status
The website you mentioned is checking for "User-Agent"
in the request's header. You can fake the "User-Agent"
in your request by passing the dict
object with Custom Headers in your requests.get(..)
call. It'll make it look like it is coming from the actual browser and you'll receive the response.
For example:
>>> import requests
>>> url = "https://www.transfermarkt.com/jumplist/startseite/wettbewerb/GB1"
>>> headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
# Make request with "User-Agent" Header
>>> response = requests.get(url, headers=headers)
>>> response.status_code
200 # success response
>>> response.text # will return the website content
Some websites do not allow scraping. So you need to provide a header with user-agent specifying type of browser and the system which says it is a browser request and not some code trying to scrape
use this in your code
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'}
response = requests.get(url, headers=headers)`
See if this helps
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With