check if the page is HTML page in python?

Question

I am trying to write a code in python for Web crawler. I want to check if the page I am about to crawl is a HTML page and not page like .pdf/.doc/.docx etc.. I do not want to check it with extension .html as asp,aspx, or pages like http://bing.com/travel/ do not .html extensions explicitly but they are html pages. Is there any good way in python?

unutbu · Accepted Answer

This gets the header only from the server:

import urllib2
url = 'http://www.kernel.org/pub/linux/kernel/v3.0/testing/linux-3.7-rc6.tar.bz2'
req = urllib2.Request(url)
req.get_method = lambda: 'HEAD'
response = urllib2.urlopen(req)
content_type = response.headers.getheader('Content-Type')
print(content_type)

prints

application/x-bzip2

From which you could conclude this is not HTML. You could use

'html' in content_type

to programmatically test if the content is HTML (or possibly XHTML). If you wanted to be even more sure the content is HTML you could download the contents and try to parse it with an HTML parser like lxml or BeautifulSoup.

Beware of using requests.get like this:

import requests
r = requests.get(url)
print(r.headers['content-type'])

This takes a long time and my network monitor shows a sustained load leading me to believe this is downloading the entire file, not just the header.

On the other hand,

import requests
r = requests.head(url)
print(r.headers['content-type'])

gets the header only.

check if the page is HTML page in python?

Tags:

python

user2793286

1 Answers

unutbu

Recent Activity

Donate For Us

check if the page is HTML page in python?

Tags:

python

user2793286

1 Answers

unutbu

Related questions

Recent Activity

Donate For Us