Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

List only duplicate lines based on one column from a semi-colon delimited file?

Tags:

linux

I have a file with a bunch of lines. Each one of these lines has 8 semi-colon delimited columns.

How can I (in Linux) return duplicate lines but only based on column number 2? Should I be using grep or something else?

like image 302
goe Avatar asked Sep 20 '09 02:09

goe


2 Answers

See my comments in the awk script

$ cat data.txt 
John Thomas;jd;301
Julie Andrews;jand;109
Alex Tremble;atrem;415
John Tomas;jd;302
Alex Trebe;atrem;416

$ cat dup.awk 
BEGIN { FS = ";" }

{
    # Keep count of the fields in second column
    count[$2]++;

    # Save the line the first time we encounter a unique field
    if (count[$2] == 1)
        first[$2] = $0;

    # If we encounter the field for the second time, print the
    # previously saved line
    if (count[$2] == 2)
        print first[$2];

    # From the second time onward. always print because the field is
    # duplicated
    if (count[$2] > 1)
        print
}

Example output:

$ sort -t ';' -k 2 data.txt | awk -f dup.awk

John Thomas;jd;301
John Tomas;jd;302
Alex Tremble;atrem;415
Alex Trebe;atrem;416

Here is my solution #2:

awk -F';' '{print $2}' data.txt |sort|uniq -d|grep -F -f - data.txt

The beauty of this solution is it preserve the line order at the expense of using many tools together (awk, sort, uniq, and fgrep).

The awk command prints out the second field, whose output is then sorted. Next, the uniq -d command picks out the duplicated strings. At this point, the standard output contains a list of duplicated second fields, one per line. We then pipe that list into fgrep. The '-f -' flag tells fgrep to look for these strings from the standard input.

Yes, you can go all out with command line. I like the second solution better for exercising many tools and for a clearer logic (at least to me). The drawback is the number of tools and possibly memory used. Also, the second solution is inefficient because it it scans the data file twice: the first time with the awk command and the second with the fgrep command. This consideration matters only when the input file is large.

like image 112
Hai Vu Avatar answered Oct 09 '22 03:10

Hai Vu


how about:

 sort -t ';' -k 2 test.txt | awk -F';' 'BEGIN{curr="";prev="";flag=0} \
                     NF==8{ prev=curr;
                            curr=$2;
                            if(prev!=curr){flag=1}
                            if(flag!=0 && prev==curr)flag++ ; 
                            if(flag==2)print $0}'

I also tried uniq command which has option for displaying repeated lines "-d" but unable to figure out if can be used with fields.

like image 36
sud03r Avatar answered Oct 09 '22 03:10

sud03r