Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Perl - Find duplicate lines in file or array

I'm trying to print duplicate lines from the filehandle, not remove them or anything else I see asked on other questions. I don't have enough experience with perl to be able to quickly do this, so I'm asking here. What's the way to do this?

like image 313
Chris Avatar asked May 04 '11 13:05

Chris


2 Answers

Using the standard Perl shorthands:

my %seen;
while ( <> ) { 
    print if $seen{$_}++;
}

As a "one-liner":

perl -ne 'print if $seen{$_}++'

More data? This prints <file name>:<line number>:<line>:

perl -ne 'print ( $ARGV eq "-" ? "" : "$ARGV:" ), "$.:$_" if $seen{$_}++'

Explanation of %seen:

  • %seen declares a hash. For each unique line in the input (which is coming from while(<>) in this case) $seen{$_} will have a scalar slot in the hash named by the the text of the line (this is what $_ is doing in the has {} braces).
  • Using the postfix increment operator (x++) we take the value for our expression, remembering to increment it after the expression. So, if we haven't "seen" the line $seen{$_} is undefined--but when forced into an numeric "context" like this, it's taken as 0--and false.
  • Then it's incremented to 1.

So, when the while begins to run, all lines are "zero" (if it helps you can think of the lines as "not %seen") then, the first time we see a line, perl takes the undefined value - which fails the if - and increments the count at the scalar slot to 1. Thus, it is 1 for any future occurrences at which point it passes the if condition and it printed.

Now as I said above, %seen declares a hash, but with strict turned off, any variable expression can be created on the spot. So the first time perl sees $seen{$_} it knows that I'm looking for %seen, it doesn't have it, so it creates it.

An added neat thing about this is that at the end, if you care to use it, you have a count of how many times each line was repeated.

like image 104
Axeman Avatar answered Sep 29 '22 03:09

Axeman


try this

#!/usr/bin/perl -w
use strict;
use warnings;

my %duplicates;
while (<DATA>) {
    print if !defined $duplicates{$_};
    $duplicates{$_}++;
}
like image 23
mcgrailm Avatar answered Sep 29 '22 04:09

mcgrailm