I am working on C# Regex.
Input text:
headera
aa1aaa
aa2aaa
aa3aaa
headerb
aa4aaa
aa5aaa
aa6aaa
headerc
aa7aaa
aa8aaa
aa9aaa
I would like to capture the numbers 4, 5, and 6 only which are between headerb and headerc
My attempts:
I was able to capture those under headera and headerb with pattern below. I cannot apply the same concept on lookbehind since this should be zero-width thus quantifiers are not allowed.
aa(\d+)aaa(?=[\s|\S]*headerc)
Repeating the capturing group will only capture the last iteration. I cannot apply some wild card regex for the multiple instances.
Please assist. Thanks
[SOLVED] Using the advantage of .Net being able to support variable-width look behind. You may use the patterns below:
@"(?<=headerb[\s|\S]*)aa(\d)aaa(?=[\s\S]*headerc)"
@"(?s)(?<=\bheaderb\b.*?)\d+(?=.*?\bheaderc\b)"
@"(?<=\bheaderb\b(?:(?!\bheaderc\b)[\s\S])*)aa(\d+)aaa"
C# supports variable lookbehind.So use it.
(?<=\bheaderb\b(?:(?!\bheaderc\b)[\s\S])*)aa(\d+)aaa
See Demo.
You regex does not match what you need because it does not include the boundaries. Note aa(\d+)aaa(?=[\s|\S]*headerc)
matches aa
, followed by 1 or more digits that are followed by any character ([\s\S]
is the same as [\s|\S]
), 0 or more occurrences, followed by headerc
. Thus, you do not have a leading boundary.
If you insist on a regex, you can make use of a variable-width lookbehind in .NET regex:
(?s)(?<=\bheaderb\b(?>(?!\bheader[bc]\b).)*)\d+
See demo. The (?<=\bheaderb\b(?>(?!\bheader[bc]\b).)*)
lookbehind makes sure there is a whole word headerb
or headerc
and some 0 or more characters as few as possible before a sequence of digits (note the Singleline modifier I added to force the .
to match a newline). The (?>(?!\bheader[bc]\b).)*
is a tempered greedy token that matches any substring that does not contain either headerc
or headerb
as whole words. It is necessary in case there is another headerb....headerc
block after headerc...headerd
(see my regex demo).
However, the regex solution is not efficient (though potenatially a "quick and dirty" one-time solution). You can also use this trick: split the input with newline symbols into a list of "lines", find the block you need with LINQ, and then apply a simple regex to find all digit sequences:
var lines = s.Split(new[] { "\r", "\n"}, StringSplitOptions.RemoveEmptyEntries); // Split into line array
var subset = lines.SkipWhile(p => p != "headerb") // Get to the "headerb" line
.Skip(1) // Get to the line after "headerb"
.TakeWhile(m => m != "headerc") // Grab the lines in the block we need
.ToList();
var digits = Regex.Matches(string.Join(string.Empty, subset), "[0-9]+")
.Cast<Match>()
.Select(v => v.Value)
.ToList();
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With