Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does this loop through Regex groups print the output twice?

Tags:

c#

regex

I have written this very straight forward regex code

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
using System.Threading.Tasks;

namespace RegexTest1
{
    class Program
    {
        static void Main(string[] args)
        {
            string a = "\"foobar123==\"";
            Regex r = new Regex("^\"(.*)\"$");
            Match m = r.Match(a);
            if (m.Success)
            {
                foreach (Group g in m.Groups)
                {
                    Console.WriteLine(g.Index);
                    Console.WriteLine(g.Value);
                }
            }
        }
    }
}

However the output is

0
"foobar123=="
1
foobar123==

I don't understand why does it print twice. why should there be a capture at index 0? when I say in my regex ^\" and I am not using capture for this.

Sorry if this is very basic but I don't write Regex on a daily basis.

According to me, this code should print only once and the index should be 1 and the value should be foobar==

like image 705
Knows Not Much Avatar asked Sep 23 '14 19:09

Knows Not Much


2 Answers

This happens because group zero is special: it returns the entire match.

From the Regex documentation (emphasis added):

A simple regular expression pattern illustrates how numbered (unnamed) and named groups can be referenced either programmatically or by using regular expression language syntax. The regular expression ((?<One>abc)\d+)?(?<Two>xyz)(.*) produces the following capturing groups by number and by name. The first capturing group (number 0) always refers to the entire pattern.

#      Name              Group
- ---------------- --------------------------------
0 0 (default name) ((?<One>abc)\d+)?(?<Two>xyz)(.*)

1 1 (default name) ((?<One>abc)\d+)

2 2 (default name) (.*)

3 One (?<One>abc)

4 Two (?<Two>xyz)

If you do not want to see it, start the output from the first group.

like image 64
Sergey Kalinichenko Avatar answered Oct 17 '22 22:10

Sergey Kalinichenko


A regex captures several groups at once. Group 0 is the entire matched region (including the accents). Group 1 is the group defined by the brackets.

Say your regex has the following form:

A(B(C)D)E.

With A, B, C, D end E regex expressions.

Then the following groups will be matched:

0 A(B(C)D)E
1 B(C)D
2 C

The i-th group starts at the i-th open bracket. And you can say the "zero-th" open bracket is implicitly placed at the begin of the regex (and ends at the end of the regex).

If you want to omit group 0, you can use the Skip method of the LINQ framework:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
using System.Threading.Tasks;

namespace RegexTest1 {
    class Program {
        static void Main(string[] args) {
            string a = "\"foobar123==\"";
            Regex r = new Regex("^\"(.*)\"$");
            Match m = r.Match(a);
            if (m.Success) {
                foreach (Group g in m.Groups.Skip(1)) {//Skipping the first (thus group 0)
                    Console.WriteLine(g.Index);
                    Console.WriteLine(g.Value);
                }
            }
        }
    }
}
like image 39
Willem Van Onsem Avatar answered Oct 17 '22 22:10

Willem Van Onsem