Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Get the capturing group of all iterations

Tags:

c#

regex

I am working on C# Regex.

Input text:

headera
aa1aaa
aa2aaa
aa3aaa

headerb
aa4aaa
aa5aaa
aa6aaa

headerc
aa7aaa
aa8aaa
aa9aaa

I would like to capture the numbers 4, 5, and 6 only which are between headerb and headerc

My attempts:

I was able to capture those under headera and headerb with pattern below. I cannot apply the same concept on lookbehind since this should be zero-width thus quantifiers are not allowed.

aa(\d+)aaa(?=[\s|\S]*headerc)

Repeating the capturing group will only capture the last iteration. I cannot apply some wild card regex for the multiple instances.

Please assist. Thanks

[SOLVED] Using the advantage of .Net being able to support variable-width look behind. You may use the patterns below:

@"(?<=headerb[\s|\S]*)aa(\d)aaa(?=[\s\S]*headerc)"
@"(?s)(?<=\bheaderb\b.*?)\d+(?=.*?\bheaderc\b)"
@"(?<=\bheaderb\b(?:(?!\bheaderc\b)[\s\S])*)aa(\d+)aaa"
like image 483
Draco Sahin Avatar asked Feb 09 '23 10:02

Draco Sahin


2 Answers

C# supports variable lookbehind.So use it.

(?<=\bheaderb\b(?:(?!\bheaderc\b)[\s\S])*)aa(\d+)aaa

See Demo.

like image 174
vks Avatar answered Feb 10 '23 23:02

vks


You regex does not match what you need because it does not include the boundaries. Note aa(\d+)aaa(?=[\s|\S]*headerc) matches aa, followed by 1 or more digits that are followed by any character ([\s\S] is the same as [\s|\S]), 0 or more occurrences, followed by headerc. Thus, you do not have a leading boundary.

If you insist on a regex, you can make use of a variable-width lookbehind in .NET regex:

(?s)(?<=\bheaderb\b(?>(?!\bheader[bc]\b).)*)\d+

See demo. The (?<=\bheaderb\b(?>(?!\bheader[bc]\b).)*) lookbehind makes sure there is a whole word headerb or headerc and some 0 or more characters as few as possible before a sequence of digits (note the Singleline modifier I added to force the . to match a newline). The (?>(?!\bheader[bc]\b).)* is a tempered greedy token that matches any substring that does not contain either headerc or headerb as whole words. It is necessary in case there is another headerb....headerc block after headerc...headerd (see my regex demo).

However, the regex solution is not efficient (though potenatially a "quick and dirty" one-time solution). You can also use this trick: split the input with newline symbols into a list of "lines", find the block you need with LINQ, and then apply a simple regex to find all digit sequences:

var lines = s.Split(new[] { "\r", "\n"}, StringSplitOptions.RemoveEmptyEntries); // Split into line array
var subset = lines.SkipWhile(p => p != "headerb") // Get to the "headerb" line
                  .Skip(1)    // Get to the line after "headerb"
                  .TakeWhile(m => m != "headerc")  // Grab the lines in the block we need
                  .ToList();
var digits = Regex.Matches(string.Join(string.Empty, subset), "[0-9]+")
                 .Cast<Match>()
                 .Select(v => v.Value)
                 .ToList();

enter image description here

like image 23
Wiktor Stribiżew Avatar answered Feb 11 '23 00:02

Wiktor Stribiżew