Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

URL validation regex for real-world URLs

I want to validate that given strings are URLs. Matching URLs in text would be nice too, but not required. I've searched and experimented but so far I did not find something that answers these requirements:

  1. Must not accept strings which, when treated as links, pose a security risk. For example, <a href="javascript:alert(document.cookie)">clickme</a> is a valid HTML element and indeed works (raises an alert and so on) in at least some browsers. I'm concerned that if I allow arbitrary schemes (see below) it can compromise security (as noted, for example, here: What is the best regular expression to check if a string is a valid URL?).

  2. Must work correctly in JavaScript.

  3. Would be nice if it worked the same in Java -- I'm developing in GWT, so this would be nice but not strictly necessary.

  4. Must accept URLs which are used in practice, and not only standard-compliant URLs. Specific examples:

    a. I want to accept http://fr.wikipedia.org/wiki/Français, which is non-standard because of the non-English character, but accepted by my reference browsers IE(7+) and Chrome.

    b. I want to accept http://fr.wikipedia.org/wiki/Fran%c3%a7ais, which is non-standard because percent-encoding hex should be uppercase, but again accepted by IE and Chrome. I guess I could just do a case-insensitive match -- any downside you can think of?

    c. I want to accept http://localhost/localpath/servlet#action?param=value, which is non-standard because the fragment part (from '#' to the end) should not include '?' and other chars, but there are apps which generate such URLs and browsers accept them.

    d. I want to accept URLs with any scheme/protocol (not just http, https and ftp), because all kinds of apps I integrate with and their users may need to pass such URLs. I can forbid 'javascript:' and allow everything else; if you think this would compromise security please say so.

There is a ton of questions on this topic in SO and elsewhere, but I did not find a regex which answers all of my requirements. Examples:

  • Regex in GWT to match URLs -- Pretty good and simple regex, but doesn't accept non-standard URLs. I can handle the scheme part and the percent-encoding case-sensitivity, but not the other issues.

  • https://stackoverflow.com/a/190405/96929 -- Giant regex (I ask myself if all browsers and frameworks I use can handle this size) which appears to be very comprehensive, but says it conforms to standard and I can't make heads or tails of it.

Thanks! :-)

like image 900
Oren Shalev Avatar asked Jan 15 '12 11:01

Oren Shalev


People also ask

What is a good regex to match a URL?

@:%_\+~#= , to match the domain/sub domain name.

Can we use regex in URL?

URL regular expressions can be used to verify if a string has a valid URL format as well as to extract an URL from a string.

How do you check the given URL is valid or not?

You can use the URLConstructor to check if a string is a valid URL. URLConstructor ( new URL(url) ) returns a newly created URL object defined by the URL parameters. A JavaScript TypeError exception is thrown if the given URL is not valid.


2 Answers

Must accept URLs which are used in practice, and not only standard-compliant URLs

Actually the URI spec is pretty liberal and permits constructs which generally you want to exclude for compatibility reasons...

I want to accept http://fr.wikipedia.org/wiki/Français, which is non-standard

It's not a URI, but it is a quite standard IRI.

  • non-standard because percent-encoding hex should be uppercase
  • non-standard because the fragment part (from '#' to the end) should not include '?'

Both of these are perfectly acceptable according to the URI standard. RFC 3986 recommends but does not require that upper-case be used when creating percent-encodings.

I can forbid 'javascript:' and allow everything else; if you think this would compromise security please say so.

It would. Unfortunately there have been multiple potentially-dangerous additions to the URI scheme namespace and there will doubtless continue to be in future. Plus there are potentially ways that blacklisting features might be evaded used encoded characters and control characters.

In addition, arbitrary-scheme matching means your secondary goal of detecting addresses in text will create a false positive most times a colon is used.

Whitelisting is the only plausible way forward, so you will just have to manually permit each new scheme on a case-by-case basis. This requires some care; for example the data: scheme seems innocuous and useful, but potentially suffers from the same XSS issues as javascript:.

You will also need to know some information about each scheme. Schemes like http and ftp have a ‘server-based naming authority’: they can include a separate hostname and resource path within that host; additionally you probably require them to be absolute URIs. If you want to allow file URIs, you'd have to check that it was hostless (file:///). For other schemes there may be no concrete syntax required by the URI standard itself, but there might be other restrictions, for example mailto: must take a valid e-mail address.

Giant regex (I ask myself if all browsers and frameworks I use can handle this size) which appears to be very comprehensive

This won't work in JavaScript because it has the unsupported \x{code point} syntax. Also languages like JavaScript whose regex engines work in terms of UTF-16 code units instead of full Unicode code points won't be able to handle character ranges outside the BMP.

You'd have to replace the long \x{A0}...\x{1FFFD} groups with something simpler like \u00A0-\uFFFD, and then check for invalid surrogate pairs separately, as well as the 0xnnFFFE–F non-characters, if you care about those (probably not).

Arguably you would probably already have excised any bad surrogates and non-characters at a general input scanning level before you get as far as IRI validation; there is no reason to allow them in any textual input. Doing that in a separate step makes more sense than trying to shoehorn everything into a single regex.

With that replaced, the longest part of the quoted regex is the insanely long string of digit-checking trying to validate numeric IP addresses. This is the kind of thing regex isn't good at at all. I would strongly consider not bothering with the IPv6 and IPv-future numeric addresses: even assuming widespread IPv6 adoption soon, no-one will be using them for the foreseeable future. (Do you even want to permit links to numeric addresses? Depends on what your app is doing, but often not.)

You might also consider disallowing userinfo@ hostname prefixes (since they have traditionally been of no use except for spoofing attacks), and percent-encoded hostnames (since they serve no purpose given the existence of Punycode, and don't work in some browsers).

So there is not one single answer to IRI validation, but here's the sort of place you might start:

(
    https?://
    (
        ([0-9]{1-3}(\.[0-9]{1-3}){3})|
        ([-0-9a-z\u00A0-\uFFFD]{1-63}(\.[-0-9a-z\u00A0-\uFFFD]{1-63})*)
    )
    (:[0-9]+)?/
    (
        %[0-9a-f][0-9a-f]|
        [-._!$&'()*+,:;=@~0-9a-z\u00A0-\uFFFD/?#]
    )*
)|(
    ftp://                                    // same again but with no ?query
    ...                                       // or port number
)|(
    mailto:                                   // specify requirements for
    ...                                       // other accepted schemes
)

(Case insensitivity assumed. This applies DNS constraints that are not part of the URI spec itself, though incompletely as it doesn't check for leading/trailing - in DNS labels, or the number range in IPv4 octets. Validating e-mail addresses is left as an exercise for the reader, as it is in itself an arduous task unsuited to regex if you want to do it rigorously.)

like image 136
bobince Avatar answered Oct 16 '22 02:10

bobince


Since you are using Java on the server side, I'd suggest you use URI. It will accept all the "bizarre" stuff you want and it is just a matter of .getScheme() to check that it is indeed HTTP or HTTPS.

And unlike URL, URI will not try to do name resolution!

like image 1
fge Avatar answered Oct 16 '22 03:10

fge