This regex:
^((https?|ftp)\:(\/\/)|(file\:\/{2,3}))?(((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3} (25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))|(((([a-zA-Z0-9]+)(\.)?)+?)(\.)([a-z]{2} |com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum))([a-zA-Z0-9\?\=\&\%\/]*)?$
Formatted for readability:
^( # Begin regex / begin address clause
(https?|ftp)\:(\/\/)|(file\:\/{2,3}))? # protocol
( # container for two address formats, more to come later
((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?) # match IP addresses
)|( # delimiter for address formats
((([a-zA-Z0-9]+)(\.)?)+?) # match domains and any number of subdomains
(\.) #dot for .com
([a-z]{2}|com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum) #TLD clause
) # end address clause
([a-zA-Z0-9\?\=\&\%\/]*)? # querystring support, will pretty this up later
$
is matching:
www.google
and shouldn't be. This is one of my "fail" test cases. I have declared the TLD portion of the URL to be mandatory when matching on alpha instead of on IP, and "google" doesn't fit into the "[a-z]{2}" clause.
Keep in mind I will fix the following issues seperately - this question is about why it matches www.google and shouldn't.
What's wrong with my regex?
edit: See also a previous problem with an earlier version of this regex on a different test case: How can I make this regex match correctly?
edit2: Fixed - The corrected regex (as asked) is:
^((https?|ftp)\:(\/\/)|(file\:\/{2,3}))?(((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3} (25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))|(((([a-zA-Z0-9]+)(\.)?)+?)(\.)([a-z]{2}|com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum))([\/][\/a-zA-Z0-9\.]*)*?([\/]?[\?][a-zA-Z0-9\=\&\%\/]*)?$
$ means "Match the end of the string" (the position after the last character in the string). Both are called anchors and ensure that the entire string is matched instead of just a substring.
In order to match a line that does not contain something, use negative lookahead (described in Recipe 2.16). Notice that in this regular expression, a negative lookahead and a dot are repeated together using a noncapturing group.
You make it non-greedy by using ". *?" When using the latter construct, the regex engine will, at every step it matches text into the "." attempt to match whatever make come after the ". *?" . This means that if for instance nothing comes after the ".
If you need to know if a string matches a regular expression RegExp , use RegExp.prototype.test() . If you only want the first match found, you might want to use RegExp.prototype.exec() instead.
"google" might not fit in [a-z]{2}
, but it does fit in [a-z]{2}([a-zA-Z0-9\?\=\&\%\/]*)?
- you forgot to require a /
after the TLD if the URL extends beyond the domain. So it's interpreting it with "www.go" as the domain and then "ogle" following it, with no slash in between. You can fix it by adding a [?/]
to the front of that last group to require one of those two symbols between the TLD and any further portion of the URL.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With