Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Google Analytics Regex - Alternative to no negative lookahead

Google Analytics does not allow negative lookahead anymore within its filters. This is proving to be very difficult to create a custom report only including the links I would like it to include.

The regex that includes negative lookahead that would work if it was enabled is:

test.com(\/\??index\_(.*)\.php\??(.*)|\/\?(.*)|\/|)+(\s)*(?!.)

This matches:

test.com
test.com/
test.com/index_fb2.php
test.com/index_fb2.php?ref=23
test.com/index_fb2.php?ref=23&e=35
test.com/?ref=23 
test.com/?ref=23&e=35

and does not match (as it should):

test.com/ambassadors
test.com/admin/?signup=true 
test.com/randomtext/

I am looking to find out how to adapt my regex to still hold the same matches but without the use of negative lookahead.

Thank you!

like image 370
eiso Avatar asked Nov 13 '12 13:11

eiso


People also ask

Can I use regex lookahead?

Lookahead assertions are part of JavaScript's original regular expression support and are thus supported in all browsers.

What is negative lookahead in regex?

Because the lookahead is negative, this means that the lookahead has successfully matched at the current position. At this point, the entire regex has matched, and q is returned as the match.

What type of regex does Google Analytics use?

In a Google Analytics 4 property, the default regex is a "full match." The data must exactly match the pattern you provide. For example, the pattern "India" only matches "India." To make this regex act like a partial match, you must use metacharacters: "India.

What is positive and negative lookahead?

Positive lookahead: (?= «pattern») matches if pattern matches what comes after the current location in the input string. Negative lookahead: (?! «pattern») matches if pattern does not match what comes after the current location in the input string.


2 Answers

Google Analytics doesn't seem to support single-line and multiline modes, which makes sense to me. URLs can't contain newlines, so it doesn't matter if the dot doesn't match them and there's never any need for ^ and $ to match anywhere but the beginning and end of the whole string.

That means the (?!.) in your regex is exactly equivalent to $, which matches only at the very end of the string (like \z, in flavors that support it). Since that's the only lookahead in your regex, you should never have have had this problem; you should have been using $ all along.

However, your regex has other problems, mostly owing to over-reliance on (.*). For example, it matches these strings:

test.com/?^#(%)!*%supercalifragilisticexpialidocious
test.com/index_ecky-ecky-ecky-ecky-PTANG!-vroop-boing_rowr.php (ni! shh!)

...which I'm pretty sure you don't want. :P

Try this regex:

test\.com(?:/(?:index_\w+\.php)?(?:\?ref=\d+(?:&e=\d+)?)?)?\s*$

or more readably:

test\.com
(?:
  /
  (?:index_\w+\.php)?
  (?:
    \?ref=\d+
    (?:
      &e=\d+
    )?
  )?
)?
\s*$

For illustration purposes I'm making a lot of simplifying assumptions about (e.g.) what parameters can be present, what order they'll appear in, and what their values can be. I'm also wondering if it's really necessary to match the domain (test.com). I have no experience with Google Analytics, but shouldn't the match start (and be anchored) right after domain? And do you really have to allow for whitespace at the end? It seems to me the regex should be more like this:

^/(?:index_\w+\.php)?(?:\?ref=\d+(?:&e=\d+)?)?$
like image 120
Alan Moore Avatar answered Nov 15 '22 08:11

Alan Moore


Firstly I think your regex needs some fixing. Let's look at what you have:

test.com(\/\??index_.*.php\??(.*)|\/\?(.*)|\/|)+(\s)*(?!.)

The case where you use the optional ? at the start of index... is already taken care of by the second alternative:

test.com(\/index_.*.php\??(.*)|\/\?(.*)|\/|)+(\s)*(?!.)

Now you probably only want the first (.*) to be allowed, if there actually was a literal ? before. Otherwise you will match test.com/index_fb2.phpanystringhereandyouprobablydon'twantthat. So move the corresponding optional marker:

test.com(\/index_.*.php(\?(.*))?|\/\?(.*)|\/|)+(\s)*(?!.)

Now .* consumes any character and as much as possible. Also, the . in front of php consumes any character. This means you would be allowing both test.com/index_fb2php and test.com/index_fb2.html?someparam=php. Let's make that a literal . and only allow non-question-mark characters:

test.com(\/index_[^?]*\.php(\?(.*))?|\/\?(.*)|\/|)+(\s)*(?!.)

Now the first and second and third option can be collapsed into one, if we make the file name optional, too:

test.com(\/(index_[^?]*\.php)?(\?(.*))?|)+(\s)*(?!.)

Finally, the + can be removed, because the (.*) inside can already take care of all possible repetitions. Also (something|) is the same as (something)?:

test.com(\/(index_[^?]*\.php)?(\?(.*))?)?(\s)*(?!.)

Seeing your input examples, this seems to be closer to what you actually want to match.

Then to answer your question. What (?!.) does depends on whether you use singleline mode or not. If you do, it asserts that you have reached the end of the string. In this case you can simply replace it by \Z, which always matches the end of the string. If you do not, then it asserts that you have reached the end of a line. In this case you can use $ but you need to also use multi-line mode, so that $ matches line-endings, too.

So, if you use singleline mode (which probably means you have only one URL per string), use this:

test.com(\/(index_[^?]*\.php)?(\?(.*))?)?(\s)*\Z

If you do not use singleline mode (which probably means you can have multiple URLs on their own lines), you should also use multiline mode and this kind of anchor instead:

test.com(\/(index_[^?]*\.php)?(\?(.*))?)?(\s)*$
like image 24
Martin Ender Avatar answered Nov 15 '22 08:11

Martin Ender