Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's the most efficient way to check for duplicates in an array of data using Perl?

Tags:

I need to see if there are duplicates in an array of strings, what's the most time-efficient way of doing it?

like image 535
teukkam Avatar asked Jun 10 '10 05:06

teukkam


People also ask

How do I check if an array contains duplicates?

To check if an array contains duplicates:Use the Array. some() method to iterate over the array. Check if the index of the first occurrence of the current value is NOT equal to the index of its last occurrence. If the condition is met, then the array contains duplicates.

How do you sort duplicate values in an array?

We can remove duplicate element in an array by 2 ways: using temporary array or using separate index. To remove the duplicate element from array, the array must be in sorted order. If array is not sorted, you can sort it by calling Arrays. sort(arr) method.


2 Answers

One of the things I love about Perl is it's ability to almost read like English. It just sort of makes sense.

use strict; use warnings;  my @array = qw/yes no maybe true false false perhaps no/;  my %seen;  foreach my $string (@array) {      next unless $seen{$string}++;     print "'$string' is duplicated.\n"; } 

Output

'false' is duplicated.

'no' is duplicated.

like image 65
Zaid Avatar answered Oct 07 '22 03:10

Zaid


Turning the array into a hash is the fastest way [O(n)], though its memory inefficient. Using a for loop is a bit faster than grep, but I'm not sure why.

#!/usr/bin/perl  use strict; use warnings;  my %count; my %dups; for(@array) {     $dups{$_}++ if $count{$_}++; } 

A memory efficient way is to sort the array in place and iterate through it looking for equal and adjacent entries.

# not exactly sort in place, but Perl does a decent job optimizing it @array = sort @array;  my $last; my %dups; for my $entry (@array) {     $dups{$entry}++ if defined $last and $entry eq $last;     $last = $entry; } 

This is nlogn speed, because of the sort, but only needs to store the duplicates rather than a second copy of the data in %count. Worst case memory usage is still O(n) (when everything is duplicated) but if your array is large and there's not a lot of duplicates you'll win.

Theory aside, benchmarking shows the latter starts to lose on large arrays (like over a million) with a high percentage of duplicates.

like image 33
Schwern Avatar answered Oct 07 '22 04:10

Schwern