Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex check if link is to a file

How to check if a given link (url) is to file or another webpage?

I mean:

  • page: https://stackoverflow.com/questions/
  • page: https://www.w3schools.com/html/default.asp
  • file: https://www.python.org/ftp/python/3.7.2/python-3.7.2.exe
  • file :http://jmlr.org/papers/volume19/16-534/16-534.pdf#page=15

Currently I am doing it with a quite hacky, multi-step checking, and it also requires converting relative to absolute links, adding http prefix if missing and removing '#' anchor links/params to work. I am also not sure if I'm whitelisting all possible page extensions that exist.

import re
def check_file(url):
    try:
        sub_domain = re.split('\/+', url)[2] # part after '2nd slash(es)''
    except:
        return False # nothing = main page, no file
    if not re.search('\.', sub_domain):
        return False # no dot, no file
    if re.search('\.htm[l]{0,1}$|\.php$|\.asp$', sub_domain):
        return False # whitelist some page extensions
    return True

tests = [
    'https://www.stackoverflow.com',
    'https://www.stackoverflow.com/randomlink',
    'https:////www.stackoverflow.com//page.php',
    'https://www.stackoverflow.com/page.html',
    'https://www.stackoverflow.com/page.htm',
    'https://www.stackoverflow.com/file.exe',
    'https://www.stackoverflow.com/image.png'
]

for test in tests:
    print(test + '\n' + str(check_file(test)))
# False: https://www.stackoverflow.com
# False: https://www.stackoverflow.com/randomlink
# False: https:////www.stackoverflow.com//page.php
# False: https://www.stackoverflow.com/page.html
# False: https://www.stackoverflow.com/page.htm
# True: https://www.stackoverflow.com/file.exe
# True: https://www.stackoverflow.com/image.png

Is there a clean, single regex match solution to this problem or a library with an established function to do it? I guess someone must have faced this problem before me, but unfortunately I couldn't find a solution here on SO or else.

like image 534
pieca Avatar asked Mar 07 '19 13:03

pieca


1 Answers

Aran-Fey's answer works well on well-behaved pages, which make up 99.99% of the web. But there's no rule that says a url ending with a particular extension must resolve to content of a particular type. A poorly-configured server could return html for a request to a page named "example.png", or it could return an mpeg for a page named "example.php", or any other combination of content types and file extensions.

The most accurate way to get content type information for a url is to actually visit that url and examine the content type in its header. Most http-interfacing libraries have a way to retrieve only the header information from a site, so this operation should be relatively quick even for very large pages. For example, if you were using requests, you might do:

import requests
def get_content_type(url):
    response = requests.head(url)
    return response.headers['Content-Type']

test_cases = [
    "http://www.example.com",
    "https://i.stack.imgur.com/T3HH6.png?s=328&g=1",
    "http://php.net/manual/en/security.hiding.php",
]    

for url in test_cases:
    print("Url:", url)
    print("Content type:", get_content_type(url))

Result:

Url: http://www.example.com
Content type: text/html; charset=UTF-8
Url: https://i.stack.imgur.com/T3HH6.png?s=328&g=1
Content type: image/png
Url: http://php.net/manual/en/security.hiding.php
Content type: text/html; charset=utf-8
like image 56
Kevin Avatar answered Sep 25 '22 05:09

Kevin