Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Check in Python if URL exists

There are quite a few questions on SO about this topic, but none of which answers the following issue. Checking a normal URL with Python requests can be done easily like so:

print requests.head('https://www.facebook.com/pixabay').status_code

A status code of 200 means the page exists. In this particular case, it's a fan page on Facebook.

Trying this with a normal user profile on Facebook can work, too:

print requests.head('https://www.facebook.com/steinberger.simon').status_code

However, there are (seemingly random) user profiles that result in a 404 status code, despite a normal browser returns a 200:

print requests.head('https://www.facebook.com/drcarl').status_code

Using a custom header with User-Agent string or checking the URL with other methods all fails the same way:

import requests, urllib, urllib2

url = 'https://www.facebook.com/drcarl'

print requests.head(url).status_code

# using an User-Agent string
headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.107 Safari/537.36' }
print requests.head(url, headers=headers).status_code

# using GET instead if HEAD as request method
print requests.get(url, stream=True).status_code

# using urllib
print urllib.urlopen(url).getcode()

# using urllib2
try:
    r = urllib2.urlopen(url)
    print r.getcode()
except urllib2.HTTPError as e:
    print e.code

There are other examples of URLs that inexplicably fail with the above methods. One of which is this: http://www.rajivbajaj.net/ It works perfectly with a 200 status code in all browsers, but results in a 403 for all Python methods described above.

I'm trying to write a reliable URL validator, but I can't see why those URLs are failing these tests. Any ideas?

like image 510
Simon Steinberger Avatar asked Oct 09 '14 08:10

Simon Steinberger


2 Answers

I think the difference between the browser and the python written code is the underlying HTTP request. The python code could not work should because the constructed HTTP request does not exactly like the one generated by browser.

Add customer headers (using the one you provided)

print requests.get(url, headers=headers).status_code

It works in my local side for url http://www.rajivbajaj.net/, to get 200.

In this example, I guess the web site has done something special to some user-agent.

like image 87
Jacky1205 Avatar answered Sep 27 '22 02:09

Jacky1205


The below code will help you:

def check_site_exist(self, url):
    try:
        url_parts = urlparse(url)
        request = requests.head("://".join([url_parts.scheme, url_parts.netloc]))
        return request.status_code == HTTPStatus.OK
    except:
        return False
like image 34
HaTiMSuM Avatar answered Sep 27 '22 02:09

HaTiMSuM