So I have a crawler that uses something like this:
#if ".mp3" in baseUrl[0] or ".pdf" in baseUrl[0]:
if baseUrl[0][-4] == "." and ".htm" not in baseUrl[0]:
raise Exception
html = requests.get(baseUrl[0], timeout=3).text
This works pretty well. What happens is, if a file like .mp4 or .m4a gets in the crawler instead of an HTML page, then the script hangs and in linux when I try to run the script it will just print:
Killed
Is there more of an efficient way to catch these non-HTML pages?
You can send a head request and check the content type. If its text/html then only proceed
r = requests.head(url)
if "text/html" in r.headers["content-type"]:
html = requests.get(url).text
else:
print "non html page"
If you just want to make single request then,
r = requests.get(url)
if "text/html" in r.headers["content-type"]:
html = r.text
else:
print "non html page"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With