I want to validate that given strings are URLs. Matching URLs in text would be nice too, but not required. I've searched and experimented but so far I did not find something that answers these requirements: <ol> <li>Must not accept strings which, when treated as links, pose a security risk. For example, <code><a href="javascript:alert(document.cookie)">clickme</a></code> is a valid HTML element and indeed works (raises an alert and so on) in at least some browsers. I'm concerned that if I allow arbitrary schemes (see below) it can compromise security (as noted, for example, here: What is the best regular expression to check if a string is a valid URL?).</li> <li>Must work correctly in JavaScript.</li> <li>Would be nice if it worked the same in Java -- I'm developing in GWT, so this would be nice but not strictly necessary.</li> <li> Must accept URLs which are used in practice, and not only standard-compliant URLs. Specific examples: a. I want to accept http://fr.wikipedia.org/wiki/Français, which is non-standard because of the non-English character, but accepted by my reference browsers IE(7+) and Chrome. b. I want to accept http://fr.wikipedia.org/wiki/Fran%c3%a7ais, which is non-standard because percent-encoding hex should be uppercase, but again accepted by IE and Chrome. I guess I could just do a case-insensitive match -- any downside you can think of? c. I want to accept http://localhost/localpath/servlet#action?param=value, which is non-standard because the fragment part (from '#' to the end) should not include '?' and other chars, but there are apps which generate such URLs and browsers accept them. d. I want to accept URLs with any scheme/protocol (not just http, https and ftp), because all kinds of apps I integrate with and their users may need to pass such URLs. I can forbid 'javascript:' and allow everything else; if you think this would compromise security please say so. </li> </ol> There is a ton of questions on this topic in SO and elsewhere, but I did not find a regex which answers all of my requirements. Examples: <ul> <li>Regex in GWT to match URLs -- Pretty good and simple regex, but doesn't accept non-standard URLs. I can handle the scheme part and the percent-encoding case-sensitivity, but not the other issues.</li> <li>https://stackoverflow.com/a/190405/96929 -- Giant regex (I ask myself if all browsers and frameworks I use can handle this size) which appears to be very comprehensive, but says it conforms to standard and I can't make heads or tails of it.</li> </ul> Thanks! :-)

<blockquote> Must accept URLs which are used in practice, and not only standard-compliant URLs </blockquote> Actually the URI spec is pretty liberal and permits constructs which generally you want to exclude for compatibility reasons... <blockquote> I want to accept http://fr.wikipedia.org/wiki/Français, which is non-standard </blockquote> It's not a URI, but it is a quite standard IRI. <blockquote> <ul> <li>non-standard because percent-encoding hex should be uppercase</li> <li>non-standard because the fragment part (from '#' to the end) should not include '?'</li> </ul> </blockquote> Both of these are perfectly acceptable according to the URI standard. RFC 3986 recommends but does not require that upper-case be used when creating percent-encodings. <blockquote> I can forbid 'javascript:' and allow everything else; if you think this would compromise security please say so. </blockquote> It would. Unfortunately there have been multiple potentially-dangerous additions to the URI scheme namespace and there will doubtless continue to be in future. Plus there are potentially ways that blacklisting features might be evaded used encoded characters and control characters. In addition, arbitrary-scheme matching means your secondary goal of detecting addresses in text will create a false positive most times a colon is used. Whitelisting is the only plausible way forward, so you will just have to manually permit each new scheme on a case-by-case basis. This requires some care; for example the <code>data:</code> scheme seems innocuous and useful, but potentially suffers from the same XSS issues as <code>javascript:</code>. You will also need to know some information about each scheme. Schemes like <code>http</code> and <code>ftp</code> have a ‘server-based naming authority’: they can include a separate hostname and resource path within that host; additionally you probably require them to be absolute URIs. If you want to allow file URIs, you'd have to check that it was hostless (<code>file:///</code>). For other schemes there may be no concrete syntax required by the URI standard itself, but there might be other restrictions, for example <code>mailto:</code> must take a valid e-mail address. <blockquote> Giant regex (I ask myself if all browsers and frameworks I use can handle this size) which appears to be very comprehensive </blockquote> This won't work in JavaScript because it has the unsupported <code>\x{code point}</code> syntax. Also languages like JavaScript whose regex engines work in terms of UTF-16 code units instead of full Unicode code points won't be able to handle character ranges outside the BMP. You'd have to replace the long <code>\x{A0}...\x{1FFFD}</code> groups with something simpler like <code>\u00A0-\uFFFD</code>, and then check for invalid surrogate pairs separately, as well as the 0xnnFFFE–F non-characters, if you care about those (probably not). Arguably you would probably already have excised any bad surrogates and non-characters at a general input scanning level before you get as far as IRI validation; there is no reason to allow them in any textual input. Doing that in a separate step makes more sense than trying to shoehorn everything into a single regex. With that replaced, the longest part of the quoted regex is the insanely long string of digit-checking trying to validate numeric IP addresses. This is the kind of thing regex isn't good at at all. I would strongly consider not bothering with the IPv6 and IPv-future numeric addresses: even assuming widespread IPv6 adoption soon, no-one will be using them for the foreseeable future. (Do you even want to permit links to numeric addresses? Depends on what your app is doing, but often not.) You might also consider disallowing userinfo@ hostname prefixes (since they have traditionally been of no use except for spoofing attacks), and percent-encoded hostnames (since they serve no purpose given the existence of Punycode, and don't work in some browsers). So there is not one single answer to IRI validation, but here's the sort of place you might start: <pre class="prettyprint"><code>( https?:// ( ([0-9]{1-3}(\.[0-9]{1-3}){3})| ([-0-9a-z\u00A0-\uFFFD]{1-63}(\.[-0-9a-z\u00A0-\uFFFD]{1-63})*) ) (:[0-9]+)?/ ( %[0-9a-f][0-9a-f]| [-._!$&'()*+,:;=@~0-9a-z\u00A0-\uFFFD/?#] )* )|( ftp:// // same again but with no ?query ... // or port number )|( mailto: // specify requirements for ... // other accepted schemes ) </code></pre> (Case insensitivity assumed. This applies DNS constraints that are not part of the URI spec itself, though incompletely as it doesn't check for leading/trailing <code>-</code> in DNS labels, or the number range in IPv4 octets. Validating e-mail addresses is left as an exercise for the reader, as it is in itself an arduous task unsuited to regex if you want to do it rigorously.)

Since you are using Java on the server side, I'd suggest you use URI. It will accept all the "bizarre" stuff you want and it is just a matter of <code>.getScheme()</code> to check that it is indeed HTTP or HTTPS. And unlike <code>URL</code>, <code>URI</code> will not try to do name resolution!

URL validation regex for real-world URLs

Tags:

javascript

regex

url

I want to validate that given strings are URLs. Matching URLs in text would be nice too, but not required. I've searched and experimented but so far I did not find something that answers these requirements:

Must not accept strings which, when treated as links, pose a security risk. For example, <a href="javascript:alert(document.cookie)">clickme</a> is a valid HTML element and indeed works (raises an alert and so on) in at least some browsers. I'm concerned that if I allow arbitrary schemes (see below) it can compromise security (as noted, for example, here: What is the best regular expression to check if a string is a valid URL?).
Must work correctly in JavaScript.
Would be nice if it worked the same in Java -- I'm developing in GWT, so this would be nice but not strictly necessary.
Must accept URLs which are used in practice, and not only standard-compliant URLs. Specific examples:

a. I want to accept http://fr.wikipedia.org/wiki/Français, which is non-standard because of the non-English character, but accepted by my reference browsers IE(7+) and Chrome.

b. I want to accept http://fr.wikipedia.org/wiki/Fran%c3%a7ais, which is non-standard because percent-encoding hex should be uppercase, but again accepted by IE and Chrome. I guess I could just do a case-insensitive match -- any downside you can think of?

c. I want to accept http://localhost/localpath/servlet#action?param=value, which is non-standard because the fragment part (from '#' to the end) should not include '?' and other chars, but there are apps which generate such URLs and browsers accept them.

d. I want to accept URLs with any scheme/protocol (not just http, https and ftp), because all kinds of apps I integrate with and their users may need to pass such URLs. I can forbid 'javascript:' and allow everything else; if you think this would compromise security please say so.

There is a ton of questions on this topic in SO and elsewhere, but I did not find a regex which answers all of my requirements. Examples:

Regex in GWT to match URLs -- Pretty good and simple regex, but doesn't accept non-standard URLs. I can handle the scheme part and the percent-encoding case-sensitivity, but not the other issues.
https://stackoverflow.com/a/190405/96929 -- Giant regex (I ask myself if all browsers and frameworks I use can handle this size) which appears to be very comprehensive, but says it conforms to standard and I can't make heads or tails of it.

Thanks! :-)

900

asked Jan 15 '12 11:01

Oren Shalev

2 Answers

Must accept URLs which are used in practice, and not only standard-compliant URLs

Actually the URI spec is pretty liberal and permits constructs which generally you want to exclude for compatibility reasons...

I want to accept http://fr.wikipedia.org/wiki/Français, which is non-standard

It's not a URI, but it is a quite standard IRI.

non-standard because percent-encoding hex should be uppercase

non-standard because the fragment part (from '#' to the end) should not include '?'

Both of these are perfectly acceptable according to the URI standard. RFC 3986 recommends but does not require that upper-case be used when creating percent-encodings.

I can forbid 'javascript:' and allow everything else; if you think this would compromise security please say so.

It would. Unfortunately there have been multiple potentially-dangerous additions to the URI scheme namespace and there will doubtless continue to be in future. Plus there are potentially ways that blacklisting features might be evaded used encoded characters and control characters.

In addition, arbitrary-scheme matching means your secondary goal of detecting addresses in text will create a false positive most times a colon is used.

Whitelisting is the only plausible way forward, so you will just have to manually permit each new scheme on a case-by-case basis. This requires some care; for example the data: scheme seems innocuous and useful, but potentially suffers from the same XSS issues as javascript:.

You will also need to know some information about each scheme. Schemes like http and ftp have a ‘server-based naming authority’: they can include a separate hostname and resource path within that host; additionally you probably require them to be absolute URIs. If you want to allow file URIs, you'd have to check that it was hostless (file:///). For other schemes there may be no concrete syntax required by the URI standard itself, but there might be other restrictions, for example mailto: must take a valid e-mail address.

Giant regex (I ask myself if all browsers and frameworks I use can handle this size) which appears to be very comprehensive

This won't work in JavaScript because it has the unsupported \x{code point} syntax. Also languages like JavaScript whose regex engines work in terms of UTF-16 code units instead of full Unicode code points won't be able to handle character ranges outside the BMP.

You'd have to replace the long \x{A0}...\x{1FFFD} groups with something simpler like \u00A0-\uFFFD, and then check for invalid surrogate pairs separately, as well as the 0xnnFFFE–F non-characters, if you care about those (probably not).

Arguably you would probably already have excised any bad surrogates and non-characters at a general input scanning level before you get as far as IRI validation; there is no reason to allow them in any textual input. Doing that in a separate step makes more sense than trying to shoehorn everything into a single regex.

With that replaced, the longest part of the quoted regex is the insanely long string of digit-checking trying to validate numeric IP addresses. This is the kind of thing regex isn't good at at all. I would strongly consider not bothering with the IPv6 and IPv-future numeric addresses: even assuming widespread IPv6 adoption soon, no-one will be using them for the foreseeable future. (Do you even want to permit links to numeric addresses? Depends on what your app is doing, but often not.)

You might also consider disallowing userinfo@ hostname prefixes (since they have traditionally been of no use except for spoofing attacks), and percent-encoded hostnames (since they serve no purpose given the existence of Punycode, and don't work in some browsers).

So there is not one single answer to IRI validation, but here's the sort of place you might start:

(
    https?://
    (
        ([0-9]{1-3}(\.[0-9]{1-3}){3})|
        ([-0-9a-z\u00A0-\uFFFD]{1-63}(\.[-0-9a-z\u00A0-\uFFFD]{1-63})*)
    )
    (:[0-9]+)?/
    (
        %[0-9a-f][0-9a-f]|
        [-._!$&'()*+,:;=@~0-9a-z\u00A0-\uFFFD/?#]
    )*
)|(
    ftp://                                    // same again but with no ?query
    ...                                       // or port number
)|(
    mailto:                                   // specify requirements for
    ...                                       // other accepted schemes
)

(Case insensitivity assumed. This applies DNS constraints that are not part of the URI spec itself, though incompletely as it doesn't check for leading/trailing - in DNS labels, or the number range in IPv4 octets. Validating e-mail addresses is left as an exercise for the reader, as it is in itself an arduous task unsuited to regex if you want to do it rigorously.)

136

answered Oct 16 '22 02:10

bobince

Since you are using Java on the server side, I'd suggest you use URI. It will accept all the "bizarre" stuff you want and it is just a matter of .getScheme() to check that it is indeed HTTP or HTTPS.

And unlike URL, URI will not try to do name resolution!

answered Oct 16 '22 03:10

fge

Related questions
                            
                                Press button, start native email program with attachment (located on webserver)
                            
                                JavaScript String object is being split into an array on jQuery.post
                            
                                How do you make SignalR work with RequireJS?
                            
                                JS library best practice: Return undefined or throw error on bad function input?
                            
                                Match phone country code with javascript
                            
                                Flash runtime doesn't work in IE8 using PLupload
                            
                                How browsers know what cookies to send to server when requesting?
                            
                                How can i make a disabled control back to enable using Javascript
                            
                                How to adjust the size of multiple adjacent textareas like in jsfiddle.net?
                            
                                Isn't JavaScript setTimeout and setInterval have some potential security vulnerability
                            
                                was clearTimeout successful?
                            
                                Should heavy variables go outside functions?
                            
                                Practical use of K-combinator (Kestrel) in javascript
                            
                                Set / Copy javascript computed style from one element to another
                            
                                python dict.update() equivalent in javascript
                            
                                SCRIPT5007: Unable to get value of the property 'SetReturnValue': object is null or undefined
                            
                                How to base64 encode inside of javascript
                            
                                How to replace underscores with spaces using a regex in Javascript
                            
                                What happens if I call a JS method with more parameters than it is defined to accept?
                            
                                Which Devices Support Javascript Geolocation via navigator.geolocation?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With