Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to extract an optional query parameter using regex in Javascript

I'd like to construct a regex that will check for a "path" and a "foo" parameter (non-negative integer). "foo" is optional. It should:

MATCH

path?foo=67                 # path found, foo = 67
path?foo=67&bar=hello       # path found, foo = 67
path?bar=bye&foo=1&baz=12   # path found, foo = 1
path?bar=123                # path found, foo = ''
path                        # path found, foo = ''

DO NOT MATCH

path?foo=37signals          # foo is not integer
path?foo=-8                 # foo cannot be negative
something?foo=1             # path not found

Also, I'd like to get the value of foo, without performing an additional match.

What would be the simplest regex to achieve this?

like image 236
Misha Moroshko Avatar asked Sep 15 '14 11:09

Misha Moroshko


People also ask

Can query params be optional?

As query parameters are not a fixed part of a path, they can be optional and can have default values.


1 Answers

The Answer

Screw your hard work, I just want the answer! Okay, here you go...

var regex = /^path(?:(?=\?)(?:[?&]foo=(\d*)(?=[&#]|$)|(?![?&]foo=)[^#])+)?(?=#|$)/,
    URIs = [
      'path',                 // valid!
      'pathbreak',            // invalid path
      'path?foo=123',         // valid!
      'path?foo=-123',        // negative
      'invalid?foo=1',        // invalid path
      'path?foo=123&bar=abc', // valid!
      'path?bar=abc&foo=123', // valid!
      'path?bar=foo',         // valid!
      'path?foo',             // valid!
      'path#anchor',          // valid!
      'path#foo=bar',         // valid!
      'path?foo=123#bar',     // valid!
      'path?foo=123abc',      // not an integer
    ];
      
for(var i = 0; i < URIs.length; i++) {
    var URI = URIs[i],
        match = regex.exec(URI);

    if(match) {
        var foo = match[1] ? match[1] : 'null';
        console.log(URI + ' matched, foo = ' + foo);
    } else {
        console.log(URI + ' is invalid...');
    }
}
<script src="https://getfirebug.com/firebug-lite-debug.js"></script>

Research

Your bounty request asks for "credible and/or official sources", so I'll quote the RFC on query strings.

The query component contains non-hierarchical data that, along with data in the path component (Section 3.3), serves to identify a resource within the scope of the URI's scheme and naming authority (if any). The query component is indicated by the first question mark ("?") character and terminated by a number sign ("#") character or by the end of the URI.

This seems pretty vague on purpose: a query string starts with the first ? and is terminated by a # (start of anchor) or the end of the URI (or string/line in our case). They go on to mention that most data sets are in key=value pairs, which is what it seems like what you expect to be parsing (so lets assume that is the case).

However, as query components are often used to carry identifying information in the form of "key=value" pairs and one frequently used value is a reference to another URI, it is sometimes better for usability to avoid percent-encoding those characters.

With all this in mind, let's assume a few things about your URIs:

  1. Your examples start with the path, so the path will be from the beginning of the string until a ? (query string), # (anchor), or the end of the string.
  2. The query string is the iffy part, since RFC doesn't really define a "norm". A browser tends to expect a query string to be generated from a form submission and be a list of key=value pairs appended by & characters. Keeping this mentality:
  • A key cannot be null, will be preceded by a ? or &, and cannot contain a =, & or #.
  • A value is optional, will be preceded by key=, and cannot contain a & or #.
  1. Anything after a # character is the anchor.

Let's Begin!

Let's start by mapping out our basic URI structure. You have a path, which is characters starting at the string and up until a ?, #, or the end of the string. You have an optional query string, which starts at a ? and goes until a # or the end of the string. And you have an optional anchor, which starts at a # and goes until the end of the string.

^
([^?#]+)
(?:
  \?
  ([^#]+)
)?
(?:
  #
  (.*)
)?
$

Let's do some clean up before digging into the query string. You can easily require the path to equal a certain value by replacing the first capture group. Whatever you replace it with (path), will have to be followed by an optional query string, an optional anchor, and the end of the string (no more, no less). Since you don't need to parse the anchor, the capturing group can be replaced by ending the match at either a # or the end of the string (which is the end of the query parameter).

^path
(?:
  \?
  ([^#\+)
)?
(?=#|$)

Stop Messing Around

Okay, I've been doing a lot of setup without really worrying about your specific example. The next example will match a specific path (path) and optionally match a query string while capturing the value of a foo parameter. This means you could stop here and check for a valid match..if the match is valid, then the first capture group must be null or a non-negative integer. But that wasn't your question, was it. This got a lot more complicated, so I'm going to explain the expression inline:

^            (?# match beginning of the string)
path         (?# match path literally)
(?:          (?# begin optional non-capturing group)
 (?=\?)      (?# lookahead for a literal ?)
 (?:         (?# begin optional non-capturing group)
   [?&]      (?# keys are preceded by ? or &)
   foo       (?# match key literally)
   (?:       (?# begin optional non-capturing group)
    =        (?# values are preceded by =)
    ([^&#]*) (?# values are 0+ length and do not contain & or #)
   )         (?# end optional non-capturing group)
  |          (?# OR)
   [^#]      (?# query strings are non-# characters)
 )+          (?# end repeating non-capturing group)
)?           (?# end optional non-capturing group)
(?=#|$)      (?# lookahead for a literal # or end of the string)

Some key takeaways here:

  • Javascript doesn't support lookbehinds, meaning you can't look behind for a ? or & before the key foo, meaning you actually have to match one of those characters, meaning the start of your query string (which looks for a ?) has to be a lookahead so that you don't actually match the ?. This also means that your query string will always be at least one character (the ?), so you want to repeat the query string [^#] 1+ times.
  • The query string now repeats one character at a time in a non-capturing group..unless it sees the key foo, in which case it captures the optional value and continues repeating.
  • Since this non-capture query string group repeats all the way until the anchor or end of the URI, a second foo value (path?foo=123&foo=bar) would overwrite the initial captured value..meaning you wouldn't 100% be able to rely on the above solution.

Final Solution?

Okay..now that I've captured the foo value, it's time to kill the match on a values that are not positive integers.

^            (?# match beginning of the string)
path         (?# match path literally)
(?:          (?# begin optional non-capturing group)
 (?=\?)      (?# lookahead for a literal ?)
 (?:         (?# begin optional non-capturing group)
   [?&]      (?# keys are preceeded by ? or &)
   foo       (?# match key literally)
   =         (?# values are preceeded by =)
   (\d*)     (?# value must be a non-negative integer)
   (?=       (?# begin lookahead)
     [&#]    (?# literally match & or #)
    |        (?# OR)
     $       (?# match end of the string)
   )         (?# end lookahead)
  |          (?# OR)
   (?!       (?# begin negative lookahead)
    [?&]     (?# literally match ? or &)
    foo=     (?# literally match foo=)
   )         (?# end negative lookahead)
   [^#]      (?# query strings are non-# characters)
 )+          (?# end repeating non-capturing group)
)?           (?# end optional non-capturing group)
(?=#|$)      (?# lookahead for a literal # or end of the string)

Let's take a closer look at some of the juju that went into that expression:

  • After finding foo=\d*, we use a lookahead to ensure that it is followed by a &, #, or the end of the string (the end of a query string value).
  • However..if there is more to foo=\d*, the regex would be kicked back by the alternator to a generic [^#] match right at the [?&] before foo. This isn't good, because it will continue to match! So before you look for a generic query string ([^#]), you must make sure you are not looking at a foo (that must be handled by the first alternation). This is where the negative lookahead (?![?&]foo=) comes in handy.
  • This will work with multiple foo keys, since they will all have to equal non-negative integers. This lets foo be optional (or equal null) as well.

Disclaimer: Most Regex101 demos use PHP for better syntax highlighting and include \n in negative character classes since there are multiple lines of examples.

like image 61
Sam Avatar answered Oct 22 '22 13:10

Sam