Python-requests: Check if URL is not HTML webpage

Question

So I have a crawler that uses something like this:

#if ".mp3" in baseUrl[0] or ".pdf" in baseUrl[0]:
if baseUrl[0][-4] == "." and ".htm" not in baseUrl[0]:
    raise Exception
html = requests.get(baseUrl[0], timeout=3).text

This works pretty well. What happens is, if a file like .mp4 or .m4a gets in the crawler instead of an HTML page, then the script hangs and in linux when I try to run the script it will just print:

Killed

Is there more of an efficient way to catch these non-HTML pages?

Ankush Shah · Accepted Answer

You can send a head request and check the content type. If its text/html then only proceed

r = requests.head(url)
if "text/html" in r.headers["content-type"]:
    html = requests.get(url).text
else:
    print "non html page"

If you just want to make single request then,

r = requests.get(url)
if "text/html" in r.headers["content-type"]:    
    html = r.text
else:
    print "non html page"

Python-requests: Check if URL is not HTML webpage

Tags:

python

python-requests

User

1 Answers

Ankush Shah

Recent Activity

Donate For Us

Python-requests: Check if URL is not HTML webpage

Tags:

python

python-requests

User

1 Answers

Ankush Shah

Related questions

Recent Activity

Donate For Us