Regex, Remove duplicate paths from delimited string

Question

I'm trying to remove duplicate file paths from a semicolon delimited strings using a regular expression. Order of the final paths does not matter.

Example Input:

C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path5;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path6;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path3;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path3;

Desired Output:

C:\Users\user\Desktop\TESTING\path5;C:\Users\user\Desktop\TESTING\path6;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path3;

I have the following regex that works but is very slow when the input strings get very long. Add to this running it over thousands of lines and the time is takes is very bad.

\b([^;]+)(?=.*;\1;);

Any tips on how to improve the performance of this is much appreciated!

Patrick Artner · Accepted Answer

Or the C# version:

using System;
using System.Collections.Generic;

public class Program
{
    public static void Main()
    {
        var paths = @"C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path5;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path6;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path3;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path3;";

        var cleaned = string.Join(";", new HashSet<string>(paths.Split(';')));

        Console.WriteLine(cleaned);
    }
}

Outputs:

C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path5;C:\Users\user\Desktop\TESTING\path6;C:\Users\user\Desktop\TESTING\path3;

Split input at ;, make it a HashSet<string>(..) to get rid of dupes, join with ; again.

Caveat: If your paths contain ; as part of the directory name, this breaks - you would have to get more creative for that case - but the same would be valid for any RegEx you use.

haukex · Answer

The typical way to remove duplicates in Perl is with a hash. See also perlfaq4: How can I remove duplicate elements from a list or array?

my $str = q{C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path5;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path6;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path3;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path3};
my %seen;
my $out = join ';', sort grep { !$seen{$_}++ } split /;/, $str;
print $out, "
";
__END__
# Output:
C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path3;C:\Users\user\Desktop\TESTING\path5;C:\Users\user\Desktop\TESTING\path6

I threw the sort in there but you can remove it if you don't need that.

Although you haven't yet specified whether the implementation is supposed to be in C# or Perl, the same idea should apply to C# as well. (Update: see Patrick Artner's answer)

Note the regex is slow because for every match of \b([^;]+), the engine has to scan the entire rest of the string for the lookahead .*;\1;, so it's essentially like having nested loops.

Regex, Remove duplicate paths from delimited string

Tags:

c#

regex

perl

Troy Harter

2 Answers

Patrick Artner

haukex

Recent Activity

Donate For Us

Regex, Remove duplicate paths from delimited string

Tags:

c#

regex

perl

Troy Harter

2 Answers

Patrick Artner

haukex

Related questions

Recent Activity

Donate For Us