I'd like to construct a regex that will check for a "path" and a "foo" parameter (non-negative integer). "foo" is optional. It should:
MATCH
path?foo=67 # path found, foo = 67
path?foo=67&bar=hello # path found, foo = 67
path?bar=bye&foo=1&baz=12 # path found, foo = 1
path?bar=123 # path found, foo = ''
path # path found, foo = ''
DO NOT MATCH
path?foo=37signals # foo is not integer
path?foo=-8 # foo cannot be negative
something?foo=1 # path not found
Also, I'd like to get the value of foo
, without performing an additional match.
What would be the simplest regex to achieve this?
As query parameters are not a fixed part of a path, they can be optional and can have default values.
Screw your hard work, I just want the answer! Okay, here you go...
var regex = /^path(?:(?=\?)(?:[?&]foo=(\d*)(?=[&#]|$)|(?![?&]foo=)[^#])+)?(?=#|$)/,
URIs = [
'path', // valid!
'pathbreak', // invalid path
'path?foo=123', // valid!
'path?foo=-123', // negative
'invalid?foo=1', // invalid path
'path?foo=123&bar=abc', // valid!
'path?bar=abc&foo=123', // valid!
'path?bar=foo', // valid!
'path?foo', // valid!
'path#anchor', // valid!
'path#foo=bar', // valid!
'path?foo=123#bar', // valid!
'path?foo=123abc', // not an integer
];
for(var i = 0; i < URIs.length; i++) {
var URI = URIs[i],
match = regex.exec(URI);
if(match) {
var foo = match[1] ? match[1] : 'null';
console.log(URI + ' matched, foo = ' + foo);
} else {
console.log(URI + ' is invalid...');
}
}
<script src="https://getfirebug.com/firebug-lite-debug.js"></script>
Your bounty request asks for "credible and/or official sources", so I'll quote the RFC on query strings.
The query component contains non-hierarchical data that, along with data in the path component (Section 3.3), serves to identify a resource within the scope of the URI's scheme and naming authority (if any). The query component is indicated by the first question mark ("?") character and terminated by a number sign ("#") character or by the end of the URI.
This seems pretty vague on purpose: a query string starts with the first ?
and is terminated by a #
(start of anchor) or the end of the URI (or string/line in our case). They go on to mention that most data sets are in key=value
pairs, which is what it seems like what you expect to be parsing (so lets assume that is the case).
However, as query components are often used to carry identifying information in the form of "key=value" pairs and one frequently used value is a reference to another URI, it is sometimes better for usability to avoid percent-encoding those characters.
With all this in mind, let's assume a few things about your URIs:
?
(query string), #
(anchor), or the end of the string.key=value
pairs appended by &
characters. Keeping this mentality:null
, will be preceded by a ?
or &
, and cannot contain a =
, &
or #
.key=
, and cannot contain a &
or #
.#
character is the anchor.Let's start by mapping out our basic URI structure. You have a path, which is characters starting at the string and up until a ?
, #
, or the end of the string. You have an optional query string, which starts at a ?
and goes until a #
or the end of the string. And you have an optional anchor, which starts at a #
and goes until the end of the string.
^
([^?#]+)
(?:
\?
([^#]+)
)?
(?:
#
(.*)
)?
$
Let's do some clean up before digging into the query string. You can easily require the path to equal a certain value by replacing the first capture group. Whatever you replace it with (path
), will have to be followed by an optional query string, an optional anchor, and the end of the string (no more, no less). Since you don't need to parse the anchor, the capturing group can be replaced by ending the match at either a #
or the end of the string (which is the end of the query parameter).
^path
(?:
\?
([^#\+)
)?
(?=#|$)
Okay, I've been doing a lot of setup without really worrying about your specific example. The next example will match a specific path (path
) and optionally match a query string while capturing the value of a foo
parameter. This means you could stop here and check for a valid match..if the match is valid, then the first capture group must be null
or a non-negative integer. But that wasn't your question, was it. This got a lot more complicated, so I'm going to explain the expression inline:
^ (?# match beginning of the string)
path (?# match path literally)
(?: (?# begin optional non-capturing group)
(?=\?) (?# lookahead for a literal ?)
(?: (?# begin optional non-capturing group)
[?&] (?# keys are preceded by ? or &)
foo (?# match key literally)
(?: (?# begin optional non-capturing group)
= (?# values are preceded by =)
([^&#]*) (?# values are 0+ length and do not contain & or #)
) (?# end optional non-capturing group)
| (?# OR)
[^#] (?# query strings are non-# characters)
)+ (?# end repeating non-capturing group)
)? (?# end optional non-capturing group)
(?=#|$) (?# lookahead for a literal # or end of the string)
Some key takeaways here:
?
or &
before the key foo
, meaning you actually have to match one of those characters, meaning the start of your query string (which looks for a ?
) has to be a lookahead so that you don't actually match the ?
. This also means that your query string will always be at least one character (the ?
), so you want to repeat the query string [^#]
1+ times.foo
, in which case it captures the optional value and continues repeating.path?foo=123&foo=bar
) would overwrite the initial captured value..meaning you wouldn't 100% be able to rely on the above solution.Okay..now that I've captured the foo
value, it's time to kill the match on a values that are not positive integers.
^ (?# match beginning of the string)
path (?# match path literally)
(?: (?# begin optional non-capturing group)
(?=\?) (?# lookahead for a literal ?)
(?: (?# begin optional non-capturing group)
[?&] (?# keys are preceeded by ? or &)
foo (?# match key literally)
= (?# values are preceeded by =)
(\d*) (?# value must be a non-negative integer)
(?= (?# begin lookahead)
[&#] (?# literally match & or #)
| (?# OR)
$ (?# match end of the string)
) (?# end lookahead)
| (?# OR)
(?! (?# begin negative lookahead)
[?&] (?# literally match ? or &)
foo= (?# literally match foo=)
) (?# end negative lookahead)
[^#] (?# query strings are non-# characters)
)+ (?# end repeating non-capturing group)
)? (?# end optional non-capturing group)
(?=#|$) (?# lookahead for a literal # or end of the string)
Let's take a closer look at some of the juju that went into that expression:
foo=\d*
, we use a lookahead to ensure that it is followed by a &
, #
, or the end of the string (the end of a query string value).foo=\d*
, the regex would be kicked back by the alternator to a generic [^#]
match right at the [?&]
before foo
. This isn't good, because it will continue to match! So before you look for a generic query string ([^#]
), you must make sure you are not looking at a foo
(that must be handled by the first alternation). This is where the negative lookahead (?![?&]foo=)
comes in handy.foo
keys, since they will all have to equal non-negative integers. This lets foo
be optional (or equal null
) as well.Disclaimer: Most Regex101 demos use PHP for better syntax highlighting and include \n
in negative character classes since there are multiple lines of examples.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With