Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex.Matches returns one match per line, not per "word"

Tags:

c#

.net

regex

I'm having a hard time understanding why the following expression \\[B.+\\] and code returns a Matches count of 1:

string r = "\\[B.+\\]";
return Regex.Matches(Markup, sRegEx);

I want to find all the instances (let's call them 'tags') (in a variable length HTML string Markup that contains no line breaks) that are prefixed by B and are enclosed in square brackets.

If the markup contains [BName], I get one match - good.

If the markup contains [BName] [BAddress], I get one match - why?

If the markup contains [BName][BAddress], I also only get one match.

On some web-based regex testers, I've noticed that if the text contains a CR character, I'll get a match per line - but I need some way to specify that I want matches returned independent of line breaks.

I've also poked around in the Groups and Captures collections of the MatchCollection, but to no avail - always just one result.

like image 992
James Rutledge Avatar asked Jan 19 '23 20:01

James Rutledge


2 Answers

You are getting only one match because, by default, .NET regular expressions are "greedy"; they try to match as much as possible with a single match.

So if your value is [BName][BAddress] you will have one match - which will match the entire string; so it will match from the [B at the beginning all the way to the last ] - instead of the first one. If you want two matches, use this pattern instead: \\[B.+?\\]

The ? after the + tells the matching engine to match as little as possible... leaving the second group to be its own match.

Slaks also noted an excellent option; specifying specifically that you do not wish to match the ending ] as part of the content, like so: \\[B[^\\]]+\\] That keeps your match 'greedy', which might be useful in some other case. In this specific instance, there may not be much difference - but it's an important thing to keep in mind depending on what data/patterns you might be dealing with specifically.


On a side note, I recommend using the C# "literal string" specifier @ for regular expression patterns, so that you do not need to double-escape things in regex patterns; So I would set the pattern like so:

string pattern = @"\[B.+?\]";

This makes it much easier to figure out regular expressions that are more complex

like image 174
Andrew Barber Avatar answered Jan 27 '23 20:01

Andrew Barber


Try the regex string \\[B.+?\\] instead. .+ on it's own (same is pretty much true for .*) will match against as many characters as possible, whereas .+? (or .*?) will match against the bare minimum number of characters whilst still satisfying the rest of the expression.

like image 45
Will A Avatar answered Jan 27 '23 20:01

Will A