Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

404 status code while making HTTP request via Python's "requests" library. However page is loading fine in browser

I am trying to web scrape the content of few of the websites. But I noticed that for some of the websites I am getting the response with status code as 200. However, for some other of them I am getting 404 status code with the response. But when I am opening these websites (returning 404) in the browser, it is loading fine for me. What am I missing here?

For example:

import requests

url_1 = "https://www.transfermarkt.com/jumplist/startseite/wettbewerb/GB1"
url_2 = "https://stackoverflow.com/questions/36516183/what-should-i-use-instead-of-urlopen-in-urllib3"

page_t = requests.get(url_2)
print(page_t.status_code)      #Getting a Not Found page and  404 status

page = requests.get(url_1)
print(page.status_code)       #Getting a Valid HTML page and 200 status
like image 446
Paul Vannan Avatar asked Jan 06 '18 06:01

Paul Vannan


2 Answers

The website you mentioned is checking for "User-Agent" in the request's header. You can fake the "User-Agent" in your request by passing the dict object with Custom Headers in your requests.get(..) call. It'll make it look like it is coming from the actual browser and you'll receive the response.

For example:

>>> import requests
>>> url = "https://www.transfermarkt.com/jumplist/startseite/wettbewerb/GB1"
>>> headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}

# Make request with "User-Agent" Header
>>> response = requests.get(url, headers=headers)
>>> response.status_code
200   # success response

>>> response.text  # will return the website content
like image 107
Moinuddin Quadri Avatar answered Sep 28 '22 00:09

Moinuddin Quadri


Some websites do not allow scraping. So you need to provide a header with user-agent specifying type of browser and the system which says it is a browser request and not some code trying to scrape

use this in your code

headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'}

response = requests.get(url, headers=headers)`

See if this helps

like image 21
Nishant Nischal Chintalapati Avatar answered Sep 28 '22 01:09

Nishant Nischal Chintalapati