How to check if a given link (url) is to file or another webpage?
I mean:
Currently I am doing it with a quite hacky, multi-step checking, and it also requires converting relative to absolute links, adding http prefix if missing and removing '#' anchor links/params to work. I am also not sure if I'm whitelisting all possible page extensions that exist.
import re
def check_file(url):
try:
sub_domain = re.split('\/+', url)[2] # part after '2nd slash(es)''
except:
return False # nothing = main page, no file
if not re.search('\.', sub_domain):
return False # no dot, no file
if re.search('\.htm[l]{0,1}$|\.php$|\.asp$', sub_domain):
return False # whitelist some page extensions
return True
tests = [
'https://www.stackoverflow.com',
'https://www.stackoverflow.com/randomlink',
'https:////www.stackoverflow.com//page.php',
'https://www.stackoverflow.com/page.html',
'https://www.stackoverflow.com/page.htm',
'https://www.stackoverflow.com/file.exe',
'https://www.stackoverflow.com/image.png'
]
for test in tests:
print(test + '\n' + str(check_file(test)))
# False: https://www.stackoverflow.com
# False: https://www.stackoverflow.com/randomlink
# False: https:////www.stackoverflow.com//page.php
# False: https://www.stackoverflow.com/page.html
# False: https://www.stackoverflow.com/page.htm
# True: https://www.stackoverflow.com/file.exe
# True: https://www.stackoverflow.com/image.png
Is there a clean, single regex match solution to this problem or a library with an established function to do it? I guess someone must have faced this problem before me, but unfortunately I couldn't find a solution here on SO or else.
Aran-Fey's answer works well on well-behaved pages, which make up 99.99% of the web. But there's no rule that says a url ending with a particular extension must resolve to content of a particular type. A poorly-configured server could return html for a request to a page named "example.png", or it could return an mpeg for a page named "example.php", or any other combination of content types and file extensions.
The most accurate way to get content type information for a url is to actually visit that url and examine the content type in its header. Most http-interfacing libraries have a way to retrieve only the header information from a site, so this operation should be relatively quick even for very large pages. For example, if you were using requests
, you might do:
import requests
def get_content_type(url):
response = requests.head(url)
return response.headers['Content-Type']
test_cases = [
"http://www.example.com",
"https://i.stack.imgur.com/T3HH6.png?s=328&g=1",
"http://php.net/manual/en/security.hiding.php",
]
for url in test_cases:
print("Url:", url)
print("Content type:", get_content_type(url))
Result:
Url: http://www.example.com
Content type: text/html; charset=UTF-8
Url: https://i.stack.imgur.com/T3HH6.png?s=328&g=1
Content type: image/png
Url: http://php.net/manual/en/security.hiding.php
Content type: text/html; charset=utf-8
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With