This regex matches and shouldn't. Why is it?

Tags:

regex

This regex:

^((https?|ftp)\:(\/\/)|(file\:\/{2,3}))?(((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3} (25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))|(((([a-zA-Z0-9]+)(\.)?)+?)(\.)([a-z]{2} |com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum))([a-zA-Z0-9\?\=\&\%\/]*)?$

Formatted for readability:

^( # Begin regex / begin address clause
  (https?|ftp)\:(\/\/)|(file\:\/{2,3}))? # protocol
  ( # container for two address formats, more to come later
   ((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}
   (25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?) # match IP addresses
  )|( # delimiter for address formats
   ((([a-zA-Z0-9]+)(\.)?)+?) # match domains and any number of subdomains
   (\.) #dot for .com
   ([a-z]{2}|com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum) #TLD clause
  ) # end address clause
([a-zA-Z0-9\?\=\&\%\/]*)? # querystring support, will pretty this up later
$

is matching:

www.google

and shouldn't be. This is one of my "fail" test cases. I have declared the TLD portion of the URL to be mandatory when matching on alpha instead of on IP, and "google" doesn't fit into the "[a-z]{2}" clause.

Keep in mind I will fix the following issues seperately - this question is about why it matches www.google and shouldn't.

Querystring needs to support proper formats only, currently accepts any combination of querystring characters
Several protocols not supported, though the scope of my requirements may not include them
uncommon TLDs with 3 characters not included
Probably matches http://www.google..com - will check for consecutive dots
Doesn't support decimal IP address formats

What's wrong with my regex?

edit: See also a previous problem with an earlier version of this regex on a different test case: How can I make this regex match correctly?

edit2: Fixed - The corrected regex (as asked) is:

^((https?|ftp)\:(\/\/)|(file\:\/{2,3}))?(((25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3} (25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?))|(((([a-zA-Z0-9]+)(\.)?)+?)(\.)([a-z]{2}|com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum))([\/][\/a-zA-Z0-9\.]*)*?([\/]?[\?][a-zA-Z0-9\=\&\%\/]*)?$

638

asked Oct 27 '09 09:10

tsilb

1 Answers

"google" might not fit in [a-z]{2}, but it does fit in [a-z]{2}([a-zA-Z0-9\?\=\&\%\/]*)? - you forgot to require a / after the TLD if the URL extends beyond the domain. So it's interpreting it with "www.go" as the domain and then "ogle" following it, with no slash in between. You can fix it by adding a [?/] to the front of that last group to require one of those two symbols between the TLD and any further portion of the URL.

110

answered Oct 21 '22 11:10

Amber

Related questions
                            
                                What is the regex for “Any positive integer, excluding 0” [duplicate]
                            
                                Replace only "CRLF" in string having both "CRLF" and "LF" line-separator
                            
                                phone number validation regex in rails
                            
                                Vimscript: get all matches of a regex over a string
                            
                                Need to understand why the regex is not replacing all matches
                            
                                Replace multiple dots in string with different character but same amount
                            
                                Extracting Float values from a string in Java
                            
                                Regex two string variables
                            
                                Regex convert a Markdown inline link into an HTML link with C#
                            
                                Regex Pattern for string replace
                            
                                Remove zeros from Date string
                            
                                Why does string.replace(/\W*/g,'_') prepend all characters?
                            
                                Regex to match any integer greater than 1
                            
                                JavaScript replace all ignoring case sensitivity
                            
                                * quantifier in Perl 6
                            
                                Pandas split after month day time from rest of string
                            
                                Does Perl's /m regex modifier match differently on Windows?
                            
                                How can I specify an optional capture group in this RegEx?
                            
                                How to parse for tags with '+' in python
                            
                                ruby regex .scan

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With