I'm having a hard time understanding why the following expression \\[B.+\\]
and code returns a Matches count of 1:
string r = "\\[B.+\\]";
return Regex.Matches(Markup, sRegEx);
I want to find all the instances (let's call them 'tags') (in a variable length HTML string Markup that contains no line breaks) that are prefixed by B and are enclosed in square brackets.
If the markup contains [BName]
, I get one match - good.
If the markup contains [BName] [BAddress]
, I get one match - why?
If the markup contains [BName][BAddress]
, I also only get one match.
On some web-based regex testers, I've noticed that if the text contains a CR character, I'll get a match per line - but I need some way to specify that I want matches returned independent of line breaks.
I've also poked around in the Groups and Captures collections of the MatchCollection, but to no avail - always just one result.
You are getting only one match because, by default, .NET regular expressions are "greedy"; they try to match as much as possible with a single match.
So if your value is [BName][BAddress]
you will have one match - which will match the entire string; so it will match from the [B
at the beginning all the way to the last ]
- instead of the first one. If you want two matches, use this pattern instead: \\[B.+?\\]
The ?
after the +
tells the matching engine to match as little as possible... leaving the second group to be its own match.
Slaks also noted an excellent option; specifying specifically that you do not wish to match the ending ]
as part of the content, like so: \\[B[^\\]]+\\]
That keeps your match 'greedy', which might be useful in some other case. In this specific instance, there may not be much difference - but it's an important thing to keep in mind depending on what data/patterns you might be dealing with specifically.
On a side note, I recommend using the C# "literal string" specifier @
for regular expression patterns, so that you do not need to double-escape things in regex patterns; So I would set the pattern like so:
string pattern = @"\[B.+?\]";
This makes it much easier to figure out regular expressions that are more complex
Try the regex string \\[B.+?\\]
instead. .+
on it's own (same is pretty much true for .*
) will match against as many characters as possible, whereas .+?
(or .*?
) will match against the bare minimum number of characters whilst still satisfying the rest of the expression.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With