Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex optional non-capturing groups

i am a total Regex Noob and spent hours trying to solve this puzzle. I think I have to use some kind of optional non-capturing groups or alternation.

I want to match the following strings:

  1. Neuer Film a von 1000

  2. Neuer Film a von 1000 mit b

  3. Neuer Film a von 1000 mit b und c

  4. Neuer Film a von 1000 mit b und c und d

  5. Neuer Film a mit b

  6. Neuer Film a mit b und c

  7. Neuer Film a mit b und c und d

My regex looks like this:

var regex = /(?:[nN]euer [Ff]ilm\s?)(.*)(?:[vV]on).(\d{4}).(?:[Mm]it)(.*)(?:[uU]nd)(.*)/g;

The problem is it matches only string 3 and 4. And it does not match the last two "und", but packs it in group No.3 not in group No.4.

Can someone please help with my Regex (which is not very user friendly at all ;)

like image 762
TrantSteel Avatar asked Apr 11 '17 19:04

TrantSteel


People also ask

How do you write a non capturing group in regex?

What you want to do is to grab the first group of the match result, surrounded by () in the regex and the way to do this is to use the non-capturing group syntax, i.e. ?: . So the regex (\p{Alpha}*[a-z])(?:@example.com) will return just the id part of the email.

What are capturing groups in regex?

Capturing groups are a way to treat multiple characters as a single unit. They are created by placing the characters to be grouped inside a set of parentheses. For example, the regular expression (dog) creates a single group containing the letters "d", "o", and "g".


1 Answers

You really need to use non-capturing optional groups (like (?:...)?), but besides, you also need anchors (^ to match the start of the string and $ to match the string end) and lazy dot matching patterns (.*?, to match as few any chars as possible).

You may use

/^[nN]euer [Ff]ilm\s*(.*?)(?:\s*[vV]on\s+(\d{4}))?(?:\s+[Mm]it\s*(.*?)(?:\s*[uU]nd\s*(.*))?)?$/

See the regex demo. In the demo, /gm modifiers are necessary since the input is a multiline string.

Pattern details:

  • ^ - start of a string anchor
  • [nN]euer [Ff]ilm - Neuer film / Neuer Film / neuer Film
  • \s* - zero or more whitespaces
  • (.*?) - Group 1: any 0+ chars other than line break chars, as few as possible (that is, up to the leftmost occurrence of the subsequent subpatterns)
  • (?:\s*[vV]on\s+(\d{4}))? - 1 or 0 occurrences of:
    • \s* - 0+ whitespaces
    • [vV]on - von or Von
    • \s+ - 1+ whitespaces
    • (\d{4}) - Group 2: 4 digits
  • (?:\s+[Mm]it\s*(.*?)(?:\s*[uU]nd\s*(.*))?)? - an optional non-capturing group matching 1 or 0 occurrences of:
    • \s+ - 1+ whitespaces
    • [Mm]it - Mit or mit
    • \s* - 0+ whitespaces
    • (.*?) - Group 3 matching any 0+ chars other than line break chars, as few as possible
    • (?:\s*[uU]nd\s*(.*))? - an optional non-capturing group matching
      • \s*[uU]nd\s* - und or Und enclosed with 0+ whitespaces
      • (.*) - Group 4 matching any 0+ chars other than line break chars, as many as possible
  • $ - end of string.

var strs = ['Neuer Film a von 1000','Neuer Film a von 1000 mit b','Neuer Film a von 1000 mit b und c','Neuer Film a von 1000 mit b und c und d','Neuer Film a mit b','Neuer Film a mit b und c','Neuer Film a mit b und c und d'];
var rx = /^[nN]euer [Ff]ilm\s*(.*?)(?:\s*[vV]on\s+(\d{4}))?(?:\s+[Mm]it\s*(.*?)(?:\s*[uU]nd\s*(.*))?)?$/;
for (var s of strs) {
   var m = rx.exec(s);
   if (m) {
     console.log('-- ' + s + ' ---');
     console.log('Group 1: ' + m[1]);
     if (m[2]) console.log('Group 2: ' + m[2]);
     if (m[3]) console.log('Group 3: ' + m[3]);
     if (m[4]) console.log('Group 4: ' + m[4]);
   }
   
}
like image 156
Wiktor Stribiżew Avatar answered Oct 19 '22 00:10

Wiktor Stribiżew