I'm trying to remove duplicate file paths from a semicolon delimited strings using a regular expression. Order of the final paths does not matter.
Example Input:
C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path5;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path6;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path3;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path3;
Desired Output:
C:\Users\user\Desktop\TESTING\path5;C:\Users\user\Desktop\TESTING\path6;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path3;
I have the following regex that works but is very slow when the input strings get very long. Add to this running it over thousands of lines and the time is takes is very bad.
\b([^;]+)(?=.*;\1;);
Any tips on how to improve the performance of this is much appreciated!
Or the C# version:
using System;
using System.Collections.Generic;
public class Program
{
public static void Main()
{
var paths = @"C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path5;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path6;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path3;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path3;";
var cleaned = string.Join(";", new HashSet<string>(paths.Split(';')));
Console.WriteLine(cleaned);
}
}
Outputs:
C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path5;C:\Users\user\Desktop\TESTING\path6;C:\Users\user\Desktop\TESTING\path3;
Split input at ;
, make it a HashSet<string>(..)
to get rid of dupes, join with ;
again.
Caveat: If your paths contain ;
as part of the directory name, this breaks - you would have to get more creative for that case - but the same would be valid for any RegEx you use.
The typical way to remove duplicates in Perl is with a hash. See also perlfaq4: How can I remove duplicate elements from a list or array?
my $str = q{C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path5;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path6;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path3;C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path3};
my %seen;
my $out = join ';', sort grep { !$seen{$_}++ } split /;/, $str;
print $out, "\n";
__END__
# Output:
C:\Users\user\Desktop\TESTING\path1;C:\Users\user\Desktop\TESTING\path3;C:\Users\user\Desktop\TESTING\path5;C:\Users\user\Desktop\TESTING\path6
I threw the sort
in there but you can remove it if you don't need that.
Although you haven't yet specified whether the implementation is supposed to be in C# or Perl, the same idea should apply to C# as well. (Update: see Patrick Artner's answer)
Note the regex is slow because for every match of \b([^;]+)
, the engine has to scan the entire rest of the string for the lookahead .*;\1;
, so it's essentially like having nested loops.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With