Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Regex, Remove duplicate paths from delimited string

Tags:

c#

regex

perl

I'm trying to remove duplicate file paths from a semicolon delimited strings using a regular expression. Order of the final paths does not matter.

Example Input:

C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path5;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path6;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path3;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path3;

Desired Output:

C:\Users\user\Desktop\TESTING\path5;C:\Users\user\Desktop\TESTING\path6;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path3;

I have the following regex that works but is very slow when the input strings get very long. Add to this running it over thousands of lines and the time is takes is very bad.

\b([^;]+)(?=.*;\1;);

Any tips on how to improve the performance of this is much appreciated!

like image 767
Troy Harter Avatar asked Feb 24 '18 09:02

Troy Harter


2 Answers

Or the C# version:

using System;
using System.Collections.Generic;

public class Program
{
    public static void Main()
    {
        var paths = @"C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path5;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path6;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path3;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path3;";

        var cleaned = string.Join(";", new HashSet<string>(paths.Split(';')));

        Console.WriteLine(cleaned);
    }
}

Outputs:

C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path5;C:\Users\user\Desktop\TESTING\path6;C:\Users\user\Desktop\TESTING\path3;

Split input at ;, make it a HashSet<string>(..) to get rid of dupes, join with ; again.


Caveat: If your paths contain ; as part of the directory name, this breaks - you would have to get more creative for that case - but the same would be valid for any RegEx you use.

like image 130
Patrick Artner Avatar answered Sep 25 '22 19:09

Patrick Artner


The typical way to remove duplicates in Perl is with a hash. See also perlfaq4: How can I remove duplicate elements from a list or array?

my $str = q{C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path5;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path6;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path3;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path3};
my %seen;
my $out = join ';', sort grep { !$seen{$_}++ } split /;/, $str;
print $out, "\n";
__END__
# Output:
C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path3;C:\Users\user\Desktop\TESTING\path5;C:\Users\user\Desktop\TESTING\path6

I threw the sort in there but you can remove it if you don't need that.

Although you haven't yet specified whether the implementation is supposed to be in C# or Perl, the same idea should apply to C# as well. (Update: see Patrick Artner's answer)

Note the regex is slow because for every match of \b([^;]+), the engine has to scan the entire rest of the string for the lookahead .*;\1;, so it's essentially like having nested loops.

like image 44
haukex Avatar answered Sep 23 '22 19:09

haukex