I am trying to find unique and duplicate data in a list of data with two columns. I really just want to compare the data in column 1.
The data might look like this (separated by a tab):
What are you doing? Che cosa stai facendo?
WHAT ARE YOU DOING? Che diavolo stai facendo?
what are you doing? Qual è il tuo problema amico?
So I have been playing around with the following:
Sorting without ignoring case (just "sort", no -f option) gives me less duplicates
gawk '{ FS = "\t" ; print $1 }' EN-IT_Corpus.txt | sort | uniq -i -D > dupes
Sorting with ignoring case ("sort -f") gives me more duplicates
gawk '{ FS = "\t" ; print $1 }' EN-IT_Corpus.txt | sort -f | uniq -i -D > dupes
Am I right to think that #2 is more accurate if I want to find duplicates ignoring case, because it sorts it ignoring case first and then finds duplicates based on the sorted data?
As far as I know I can't combine the sort and unique commands because sort doesn't have an option for displaying duplicates.
Thanks, Steve
You might keep it simple:
sort -uf
#where sort -u = the unique findings
# sort -f = insensitive case
I think the key is to preprocess the data:
file="EN-IT_Corpus.txt"
dups="dupes.$$"
sed 's/ .*//' $file | sort -f | uniq -i -D > $dups
fgrep -i -f $dups $file
The sed
command generates just the English words; these are sorted case-insensitively, and then run through uniq
case-insensitively, only printing duplicated entries. Then process the data file again, looking for those duplicated keys with fgrep
or grep -F
, specifying the patterns to look for in the file -f $dups
. Obviously (I hope) the big white space in the sed
command is a tab; you may be able to write \t
depending on your shell and sed
and so on.
In fact, with GNU grep
, you can do:
sed 's/ .*//' $file |
sort -f |
uniq -i -D |
fgrep -i -f - $file
And if the number of duplicates is really big, you can squeeze them down with:
sed 's/ .*//' $file |
sort -f |
uniq -i -D |
sort -f -u |
fgrep -i -f - $file
Given the input data:
What a surprise? Vous etes surpris?
What are you doing? Che cosa stai facendo?
WHAT ARE YOU DOING? Che diavolo stai facendo?
Provacation Provacatore
what are you doing? Qual è il tuo problema amico?
Ambiguous Ambiguere
the output from all of these is:
What are you doing? Che cosa stai facendo?
WHAT ARE YOU DOING? Che diavolo stai facendo?
what are you doing? Qual è il tuo problema amico?
or this:
unique:
awk '!arr[tolower($1)]++' inputfile > unique.txt
duplicates
awk '{arr[tolower($1)]++; next}
END{for (i in arr {if(arr[i]>1){print i, "count:", arr[i]}} }' inputfile > dup.txt
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With