Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I get a regex match to only be added once to the matches collection?

Tags:

c#

regex

I have a string which has several html comments in it. I need to count the unique matches of an expression.

For example, the string might be:

var teststring = "<!--X1-->Hi<!--X1-->there<!--X2-->";

I currently use this to get the matches:

var regex = new Regex("<!--X.-->");
var matches = regex.Matches(teststring);

The results of this is 3 matches. However, I would like to have this be only 2 matches since there are only two unique matches.

I know I can probably loop through the resulting MatchCollection and remove the extra Match, but I'm hoping there is a more elegant solution.

Clarification: The sample string is greatly simplified from what is actually being used. There can easily be an X8 or X9, and there are likely dozens of each in the string.

like image 853
Sailing Judo Avatar asked Mar 20 '09 13:03

Sailing Judo


People also ask

How do you match periods in regex?

Use the escape character \ to match a period with regex within a regular expression to match a literal period since, by default, the dot . is a metacharacter in regex that matches any character except a newline.

Does empty regex match everything?

An empty regular expression matches everything.

What does $1 do in regex?

For example, the replacement pattern $1 indicates that the matched substring is to be replaced by the first captured group.


2 Answers

I would just use the Enumerable.Distinct Method for example like this:

string subjectString = "<!--X1-->Hi<!--X1-->there<!--X2--><!--X1-->Hi<!--X1-->there<!--X2-->";
var regex = new Regex(@"<!--X\d-->");
var matches = regex.Matches(subjectString);
var uniqueMatches = matches
    .OfType<Match>()
    .Select(m => m.Value)
    .Distinct();

uniqueMatches.ToList().ForEach(Console.WriteLine);

Outputs this:

<!--X1-->  
<!--X2-->

For regular expression, you could maybe use this one?

(<!--X\d-->)(?!.*\1.*)

Seems to work on your test string in RegexBuddy at least =)

// (<!--X\d-->)(?!.*\1.*)
// 
// Options: dot matches newline
// 
// Match the regular expression below and capture its match into backreference number 1 «(<!--X\d-->)»
//    Match the characters “<!--X” literally «<!--X»
//    Match a single digit 0..9 «\d»
//    Match the characters “-->” literally «-->»
// Assert that it is impossible to match the regex below starting at this position (negative lookahead) «(?!.*\1.*)»
//    Match any single character «.*»
//       Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
//    Match the same text as most recently matched by capturing group number 1 «\1»
//    Match any single character «.*»
//       Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
like image 70
Svish Avatar answered Oct 20 '22 18:10

Svish


It appears you're doing two different things:

  1. Matching comments like /<-- X. -->/
  2. Finding the set of unique comments

So it is fairly logical to handle these as two different steps:

var regex = new Regex("<!--X.-->");
var matches = regex.Matches(teststring);

var uniqueMatches = matches.Cast<Match>().Distinct(new MatchComparer());

class MatchComparer : IEqualityComparer<Match>
{
    public bool Equals(Match a, Match b)
    {
        return a.Value == b.Value;
    }

    public int GetHashCode(Match match)
    {
        return match.Value.GetHashCode();
    }
}
like image 21
user7116 Avatar answered Oct 20 '22 19:10

user7116