Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Python-requests: Check if URL is not HTML webpage

So I have a crawler that uses something like this:

#if ".mp3" in baseUrl[0] or ".pdf" in baseUrl[0]:
if baseUrl[0][-4] == "." and ".htm" not in baseUrl[0]:
    raise Exception
html = requests.get(baseUrl[0], timeout=3).text

This works pretty well. What happens is, if a file like .mp4 or .m4a gets in the crawler instead of an HTML page, then the script hangs and in linux when I try to run the script it will just print:

Killed

Is there more of an efficient way to catch these non-HTML pages?

like image 759
User Avatar asked Aug 19 '14 20:08

User


1 Answers

You can send a head request and check the content type. If its text/html then only proceed

r = requests.head(url)
if "text/html" in r.headers["content-type"]:
    html = requests.get(url).text
else:
    print "non html page"

If you just want to make single request then,

r = requests.get(url)
if "text/html" in r.headers["content-type"]:    
    html = r.text
else:
    print "non html page"
like image 184
Ankush Shah Avatar answered Nov 02 '22 23:11

Ankush Shah