I have a file with a bunch of lines. Each one of these lines has 8 semi-colon delimited columns.
How can I (in Linux) return duplicate lines but only based on column number 2?
Should I be using grep
or something else?
See my comments in the awk script
$ cat data.txt
John Thomas;jd;301
Julie Andrews;jand;109
Alex Tremble;atrem;415
John Tomas;jd;302
Alex Trebe;atrem;416
$ cat dup.awk
BEGIN { FS = ";" }
{
# Keep count of the fields in second column
count[$2]++;
# Save the line the first time we encounter a unique field
if (count[$2] == 1)
first[$2] = $0;
# If we encounter the field for the second time, print the
# previously saved line
if (count[$2] == 2)
print first[$2];
# From the second time onward. always print because the field is
# duplicated
if (count[$2] > 1)
print
}
Example output:
$ sort -t ';' -k 2 data.txt | awk -f dup.awk
John Thomas;jd;301
John Tomas;jd;302
Alex Tremble;atrem;415
Alex Trebe;atrem;416
Here is my solution #2:
awk -F';' '{print $2}' data.txt |sort|uniq -d|grep -F -f - data.txt
The beauty of this solution is it preserve the line order at the expense of using many tools together (awk, sort, uniq, and fgrep).
The awk command prints out the second field, whose output is then sorted. Next, the uniq -d command picks out the duplicated strings. At this point, the standard output contains a list of duplicated second fields, one per line. We then pipe that list into fgrep. The '-f -' flag tells fgrep to look for these strings from the standard input.
Yes, you can go all out with command line. I like the second solution better for exercising many tools and for a clearer logic (at least to me). The drawback is the number of tools and possibly memory used. Also, the second solution is inefficient because it it scans the data file twice: the first time with the awk command and the second with the fgrep command. This consideration matters only when the input file is large.
how about:
sort -t ';' -k 2 test.txt | awk -F';' 'BEGIN{curr="";prev="";flag=0} \
NF==8{ prev=curr;
curr=$2;
if(prev!=curr){flag=1}
if(flag!=0 && prev==curr)flag++ ;
if(flag==2)print $0}'
I also tried uniq
command which has option for displaying repeated lines "-d" but unable to figure out if can be used with fields.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With