Google Analytics does not allow negative lookahead anymore within its filters. This is proving to be very difficult to create a custom report only including the links I would like it to include. The regex that includes negative lookahead that would work if it was enabled is: <pre class="prettyprint"><code>test.com(\/\??index\_(.*)\.php\??(.*)|\/\?(.*)|\/|)+(\s)*(?!.) </code></pre> This matches: <pre class="prettyprint"><code>test.com test.com/ test.com/index_fb2.php test.com/index_fb2.php?ref=23 test.com/index_fb2.php?ref=23&e=35 test.com/?ref=23 test.com/?ref=23&e=35 </code></pre> and does not match (as it should): <pre class="prettyprint"><code>test.com/ambassadors test.com/admin/?signup=true test.com/randomtext/ </code></pre> I am looking to find out how to adapt my regex to still hold the same matches but without the use of negative lookahead. Thank you!

Firstly I think your regex needs some fixing. Let's look at what you have: <pre class="prettyprint"><code>test.com(\/\??index_.*.php\??(.*)|\/\?(.*)|\/|)+(\s)*(?!.) </code></pre> The case where you use the optional <code>?</code> at the start of <code>index...</code> is already taken care of by the second alternative: <pre class="prettyprint"><code>test.com(\/index_.*.php\??(.*)|\/\?(.*)|\/|)+(\s)*(?!.) </code></pre> Now you probably only want the first <code>(.*)</code> to be allowed, if there actually was a literal <code>?</code> before. Otherwise you will match <code>test.com/index_fb2.phpanystringhereandyouprobablydon'twantthat</code>. So move the corresponding optional marker: <pre class="prettyprint"><code>test.com(\/index_.*.php(\?(.*))?|\/\?(.*)|\/|)+(\s)*(?!.) </code></pre> Now <code>.*</code> consumes any character and as much as possible. Also, the <code>.</code> in front of <code>php</code> consumes any character. This means you would be allowing both <code>test.com/index_fb2php</code> and <code>test.com/index_fb2.html?someparam=php</code>. Let's make that a literal <code>.</code> and only allow non-question-mark characters: <pre class="prettyprint"><code>test.com(\/index_[^?]*\.php(\?(.*))?|\/\?(.*)|\/|)+(\s)*(?!.) </code></pre> Now the first and second and third option can be collapsed into one, if we make the file name optional, too: <pre class="prettyprint"><code>test.com(\/(index_[^?]*\.php)?(\?(.*))?|)+(\s)*(?!.) </code></pre> Finally, the <code>+</code> can be removed, because the <code>(.*)</code> inside can already take care of all possible repetitions. Also <code>(something|)</code> is the same as <code>(something)?</code>: <pre class="prettyprint"><code>test.com(\/(index_[^?]*\.php)?(\?(.*))?)?(\s)*(?!.) </code></pre> Seeing your input examples, this seems to be closer to what you actually want to match. Then to answer your question. What <code>(?!.)</code> does depends on whether you use <code>singleline</code> mode or not. If you do, it asserts that you have reached the end of the string. In this case you can simply replace it by <code>\Z</code>, which always matches the end of the string. If you do not, then it asserts that you have reached the end of a line. In this case you can use <code>$</code> but you need to also use multi-line mode, so that <code>$</code> matches line-endings, too. So, if you use <code>singleline</code> mode (which probably means you have only one URL per string), use this: <pre class="prettyprint"><code>test.com(\/(index_[^?]*\.php)?(\?(.*))?)?(\s)*\Z </code></pre> If you do not use <code>singleline</code> mode (which probably means you can have multiple URLs on their own lines), you should also use <code>multiline</code> mode and this kind of anchor instead: <pre class="prettyprint"><code>test.com(\/(index_[^?]*\.php)?(\?(.*))?)?(\s)*$ </code></pre>

Google Analytics Regex - Alternative to no negative lookahead

Tags:

regex

google-analytics

Google Analytics does not allow negative lookahead anymore within its filters. This is proving to be very difficult to create a custom report only including the links I would like it to include.

The regex that includes negative lookahead that would work if it was enabled is:

test.com(\/\??index\_(.*)\.php\??(.*)|\/\?(.*)|\/|)+(\s)*(?!.)

This matches:

test.com
test.com/
test.com/index_fb2.php
test.com/index_fb2.php?ref=23
test.com/index_fb2.php?ref=23&e=35
test.com/?ref=23 
test.com/?ref=23&e=35

and does not match (as it should):

test.com/ambassadors
test.com/admin/?signup=true 
test.com/randomtext/

I am looking to find out how to adapt my regex to still hold the same matches but without the use of negative lookahead.

Thank you!

370

asked Nov 13 '12 13:11

eiso

2 Answers

Google Analytics doesn't seem to support single-line and multiline modes, which makes sense to me. URLs can't contain newlines, so it doesn't matter if the dot doesn't match them and there's never any need for ^ and $ to match anywhere but the beginning and end of the whole string.

That means the (?!.) in your regex is exactly equivalent to $, which matches only at the very end of the string (like \z, in flavors that support it). Since that's the only lookahead in your regex, you should never have have had this problem; you should have been using $ all along.

However, your regex has other problems, mostly owing to over-reliance on (.*). For example, it matches these strings:

test.com/?^#(%)!*%supercalifragilisticexpialidocious
test.com/index_ecky-ecky-ecky-ecky-PTANG!-vroop-boing_rowr.php (ni! shh!)

...which I'm pretty sure you don't want. :P

Try this regex:

test\.com(?:/(?:index_\w+\.php)?(?:\?ref=\d+(?:&e=\d+)?)?)?\s*$

or more readably:

test\.com
(?:
  /
  (?:index_\w+\.php)?
  (?:
    \?ref=\d+
    (?:
      &e=\d+
    )?
  )?
)?
\s*$

For illustration purposes I'm making a lot of simplifying assumptions about (e.g.) what parameters can be present, what order they'll appear in, and what their values can be. I'm also wondering if it's really necessary to match the domain (test.com). I have no experience with Google Analytics, but shouldn't the match start (and be anchored) right after domain? And do you really have to allow for whitespace at the end? It seems to me the regex should be more like this:

^/(?:index_\w+\.php)?(?:\?ref=\d+(?:&e=\d+)?)?$

120

answered Nov 15 '22 08:11

Alan Moore

Firstly I think your regex needs some fixing. Let's look at what you have:

test.com(\/\??index_.*.php\??(.*)|\/\?(.*)|\/|)+(\s)*(?!.)

The case where you use the optional ? at the start of index... is already taken care of by the second alternative:

test.com(\/index_.*.php\??(.*)|\/\?(.*)|\/|)+(\s)*(?!.)

Now you probably only want the first (.*) to be allowed, if there actually was a literal ? before. Otherwise you will match test.com/index_fb2.phpanystringhereandyouprobablydon'twantthat. So move the corresponding optional marker:

test.com(\/index_.*.php(\?(.*))?|\/\?(.*)|\/|)+(\s)*(?!.)

Now .* consumes any character and as much as possible. Also, the . in front of php consumes any character. This means you would be allowing both test.com/index_fb2php and test.com/index_fb2.html?someparam=php. Let's make that a literal . and only allow non-question-mark characters:

test.com(\/index_[^?]*\.php(\?(.*))?|\/\?(.*)|\/|)+(\s)*(?!.)

Now the first and second and third option can be collapsed into one, if we make the file name optional, too:

test.com(\/(index_[^?]*\.php)?(\?(.*))?|)+(\s)*(?!.)

Finally, the + can be removed, because the (.*) inside can already take care of all possible repetitions. Also (something|) is the same as (something)?:

test.com(\/(index_[^?]*\.php)?(\?(.*))?)?(\s)*(?!.)

Seeing your input examples, this seems to be closer to what you actually want to match.

Then to answer your question. What (?!.) does depends on whether you use singleline mode or not. If you do, it asserts that you have reached the end of the string. In this case you can simply replace it by \Z, which always matches the end of the string. If you do not, then it asserts that you have reached the end of a line. In this case you can use $ but you need to also use multi-line mode, so that $ matches line-endings, too.

So, if you use singleline mode (which probably means you have only one URL per string), use this:

test.com(\/(index_[^?]*\.php)?(\?(.*))?)?(\s)*\Z

If you do not use singleline mode (which probably means you can have multiple URLs on their own lines), you should also use multiline mode and this kind of anchor instead:

test.com(\/(index_[^?]*\.php)?(\?(.*))?)?(\s)*$

answered Nov 15 '22 08:11

Martin Ender

Related questions
                            
                                How to exclude/redirect certain url pattern in web.xml or Guice servlet module?
                            
                                Javascript Regex: Match text NOT part of a HTML tag
                            
                                Multithreaded use of Regex
                            
                                Regex: Match String From Word to Word
                            
                                PowerShell: Copy/Move Files based on a regex value, retaining the folder structure, etc
                            
                                How to split string to 2D array with Regex?
                            
                                python: regex only gets the last occurrence
                            
                                Regex to match groups with no specific order
                            
                                PHP match control characters but not whitespace?
                            
                                Escape a variable within a Regular Expression
                            
                                How to find out which chars are defined as alphanumeric for a given locale
                            
                                How to transform a string to lowercase with preg_replace
                            
                                Bug in JavaScript V8 regex engine when matching beginning-of-line?
                            
                                Basic regex for 16 digit numbers
                            
                                How can I transform this Backus–Naur Form expression into a Regex (.Net)?
                            
                                C++ std::regex multiline syntax
                            
                                PHP Regex to split an SQL field list
                            
                                Simplify regular expression for time literals (like "10h50m")
                            
                                Having problems matching whitespace whith MySql REGEX
                            
                                Remove hasTip javascript code from Joomla

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With