Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

awk + How do I find duplicates in a column?

Tags:

awk

How do I find duplicates in a column?

$ head countries_lat_long_int_code3.csv | cat -n
     1  country,latitude,longitude,name,code
     2  AD,42.546245,1.601554,Andorra,376
     3  AE,23.424076,53.847818,United Arab Emirates,971
     4  AF,33.93911,67.709953,Afghanistan,93
     5  AG,17.060816,-61.796428,Antigua and Barbuda,1
     6  AI,18.220554,-63.068615,Anguilla,1
     7  AL,41.153332,20.168331,Albania,355
     8  AM,40.069099,45.038189,Armenia,374
     9  AN,12.226079,-69.060087,Netherlands Antilles,599
    10  AO,-11.202692,17.873887,Angola,244

For instance this has duplicates in the 5th column.

     5  AG,17.060816,-61.796428,Antigua and Barbuda,1
     6  AI,18.220554,-63.068615,Anguilla,1

How do I view all the others in this file?

I know I can do this:

awk -F, 'NR>1{print $5}' countries_lat_long_int_code3.csv | sort

And I can eyeball and see if there is any duplicates, but is there a better way?

Or I can do this: Find out how may are there completely

$ awk -F, 'NR>1{print $5}' countries_lat_long_int_code3.csv | sort | wc -l
210

Find out how many unique values are there

$ awk -F, 'NR>1{print $5}' countries_lat_long_int_code3.csv | sort | uniq | wc -l
183

Therefore there are at most 27 (210-183) duplicates.

EDIT1

My desired output would be something as follows, basically all the columns but just showing the rows that are duplicates:

 5  AG,17.060816,-61.796428,Antigua and Barbuda,1
 6  AI,18.220554,-63.068615,Anguilla,1
like image 851
HattrickNZ Avatar asked Aug 19 '15 01:08

HattrickNZ


People also ask

How do I find duplicate records?

One way to find duplicate records from the table is the GROUP BY statement. The GROUP BY statement in SQL is used to arrange identical data into groups with the help of some functions. i.e if a particular column has the same values in different rows then it will arrange these rows in a group.


1 Answers

This will give you the duplicated codes

awk -F, 'a[$5]++{print $5}'

if you're only interested in count of duplicate codes

awk -F, 'a[$5]++{count++} END{print count}'

To print duplicated rows try this

awk -F, '$5 in a{print a[$5]; print} {a[$5]=$0}'

This will print the whole row with duplicates found in col $5:

awk -F, 'a[$5]++{print $0}'
like image 173
karakfa Avatar answered Oct 10 '22 13:10

karakfa