Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to conditional regex

Tags:

c#

regex

I want a regex that does one thing if it has 3 instances of .in the string, and something else if it has more than 3 instances.

for example

aaa.bbb.ccc.ddd // one part of the regex

aaa.bbb.ccc.ddd.eee // the second part of the regex

how do I achieve this in either js or c#?

something like

?(\.){4} then THIS else THAT

within the regex...

Update

Ok basically what I'm doing is this:

I want to switch, for any given System.Uri, to another subdomain in an extension method.

The problem I came across is that my domains are usually of the form http://subdomain.domain.TLD.TLD/more/url, but sometimes, it can be just http://domain.TLD.TLD/more/url (which just points to www)

So this is what I came up with:

public static class UriExtensions
{
    private const string TopLevelDomainRegex = @"(\.[^\.]{2,3}|\.[^\.]{2,3}\.[^\.]{2,3})$";
    private const string UnspecifiedSubdomainRegex = @"^((http[s]?|ftp):\/\/)(()([^:\/\s]+))(:([^\/]*))?((?:\/)?|(?:\/)(((\w+)*\/)([\w\-\.]+[^#?\s]+)(\?([^#]*))?(#(.*))?))?$";
    private const string SpecifiedSubdomainRegex = @"^((http[s]?|ftp):\/\/)(([^.:\/\s]*)[\.]([^:\/\s]+))(:([^\/]*))?((?:\/)?|(?:\/)(((\w+)*\/)([\w\-\.]+[^#?\s]+)(\?([^#]*))?(#(.*))?))?$";

    public static string AbsolutePathToSubdomain(this Uri uri, string subdomain)
    {
        subdomain = subdomain == "www" ? string.Empty : string.Concat(subdomain, ".");

        var replacement = "$1{0}$5$6".FormatWith(subdomain);

        var spec = Regex.Replace(uri.Authority, TopLevelDomainRegex, string.Empty).Distinct().Count(c => c == '.') != 0;
        return Regex.Replace(uri.AbsoluteUri, spec ? SpecifiedSubdomainRegex : UnspecifiedSubdomainRegex, replacement);
    }
}

Basically with this code I take the System.Uri and:

  1. Take just the subdomain.domain.TLD.TLD using the Authority property.
  2. Match it against "pseudo TLDs" (I'm never going to have a registered domain with 2-3 letters that would break the regex, which basically checks for anything ending in .XX[X] or .XX[X].XX[X])
  3. I strip the TLDs, and end up with either domain or subdomain.domain
  4. If the resulting data has zero dots, I use the UnspecifiedSubdomainRegex, because I couldn't figure out how to use SpecifiedSubdomainRegex to tell it that if it has no dots on that part, it should return string.Empty

My question then is if there is a way to merge these three regexes into something simpler

PD: Forget about javascript, I was just using it to test the regex on the fly

like image 222
bevacqua Avatar asked Jul 24 '11 02:07

bevacqua


People also ask

What does ?= Mean in regex?

?= is a positive lookahead, a type of zero-width assertion. What it's saying is that the captured match must be followed by whatever is within the parentheses but that part isn't captured. Your example means the match needs to be followed by zero or more characters and then a digit (but again that part isn't captured).

What is difference [] and () in regex?

[] denotes a character class. () denotes a capturing group. [a-z0-9] -- One character that is in the range of a-z OR 0-9. (a-z0-9) -- Explicit capture of a-z0-9 .

How do you match expressions in regex?

To match a character having special meaning in regex, you need to use a escape sequence prefix with a backslash ( \ ). E.g., \. matches "." ; regex \+ matches "+" ; and regex \( matches "(" . You also need to use regex \\ to match "\" (back-slash).

Can you use wildcard in regex?

Regular Expressions (Regex) In fact, you can think of regular expressions as wildcards on steroids. A wildcard expression such as *. txt to find all text files in a file manager would become . *\.


3 Answers

You can do this using the (?(?=condition)then|else) construct. However, this is not available in JavaScript (but it is available in .NET, Perl and PCRE):

^(?(?=(?:[^.]*\.){3}[^.]*$)aaa|eee)

for example, will check if a string contains exactly three dots, and if it does, it tries to match aaa at the start of the string; otherwise it tries to match eee. So it will match the first three letters of

aaa.bbb.ccc.ddd
eee.ddd.ccc.bbb.aaa
eee

but fail on

aaa.bbb.ccc
eee.ddd.ccc.bbb
aaa.bbb.ccc.ddd.eee

Explanation:

^            # Start of string
(?           # Conditional: If the following lookahead succeeds:
 (?=         #   Positive lookahead - can we match...
  (?:        #     the following group, consisting of
   [^.]*\.   #     0+ non-dots and 1 dot
  ){3}       #     3 times
  [^.]*      #     followed only by non-dots...
  $          #     until end-of-string?
 )           #   End of lookahead
 aaa         # Then try to match aaa
|            # else...
 eee         # try to match eee
)            # End of conditional
like image 194
Tim Pietzcker Avatar answered Sep 23 '22 23:09

Tim Pietzcker


^(?:[^.]*\.[^.]*){3}$

the regex above will match the string that has exactly 3 dots --- http://rubular.com/r/Tsaemvz1Yi.

^(?:[^.]*\.[^.]*){4,}$

and this one - for the string that has 4 dots or more --- http://rubular.com/r/IJDeQWVhEB

like image 33
zerkms Avatar answered Sep 19 '22 23:09

zerkms


In Python (excuse me; but regexes are without language frontier)

import re

regx = re.compile('^([^.]*?\.){3}[^.]*?\.')

for ss in ("aaa.bbb.ccc",
           "aaa.bbb.ccc.ddd",
           'aaa.bbb.ccc.ddd.eee',
           'a.b.c.d.e.f.g.h.i...'):
  if regx.search(ss):
    print ss + '     has at least 4 dots in it'
  else:
    print ss + '     has a maximum of 3 dots in it'

result

aaa.bbb.ccc     has a maximum of 3 dots in it
aaa.bbb.ccc.ddd     has a maximum of 3 dots in it
aaa.bbb.ccc.ddd.eee     has at least 4 dots in it
a.b.c.d.e.f.g.h.i...     has at least 4 dots in it

This regex' pattern doesn't require that the entire string be analysed (no symbol $ in it). It's better on long strings.

like image 37
eyquem Avatar answered Sep 19 '22 23:09

eyquem