Using linux command "sort -f | uniq -i" together for ignoring case

Question

I am trying to find unique and duplicate data in a list of data with two columns. I really just want to compare the data in column 1.

The data might look like this (separated by a tab):

What are you doing?     Che cosa stai facendo?
WHAT ARE YOU DOING?     Che diavolo stai facendo?
what are you doing?     Qual è il tuo problema amico?

So I have been playing around with the following:

Sorting without ignoring case (just "sort", no -f option) gives me less duplicates

gawk '{ FS = " " ; print $1 }' EN-IT_Corpus.txt | sort | uniq -i -D > dupes
Sorting with ignoring case ("sort -f") gives me more duplicates

gawk '{ FS = " " ; print $1 }' EN-IT_Corpus.txt | sort -f | uniq -i -D > dupes

Am I right to think that #2 is more accurate if I want to find duplicates ignoring case, because it sorts it ignoring case first and then finds duplicates based on the sorted data?

As far as I know I can't combine the sort and unique commands because sort doesn't have an option for displaying duplicates.

Thanks, Steve

stefansson · Accepted Answer

You might keep it simple:

sort -uf
#where sort -u = the unique findings
#      sort -f = insensitive case

Jonathan Leffler · Answer

I think the key is to preprocess the data:

file="EN-IT_Corpus.txt"
dups="dupes.$$"
sed 's/        .*//' $file | sort -f | uniq -i -D > $dups
fgrep -i -f $dups $file

The sed command generates just the English words; these are sorted case-insensitively, and then run through uniq case-insensitively, only printing duplicated entries. Then process the data file again, looking for those duplicated keys with fgrep or grep -F, specifying the patterns to look for in the file -f $dups. Obviously (I hope) the big white space in the sed command is a tab; you may be able to write depending on your shell and sed and so on.

In fact, with GNU grep, you can do:

sed 's/        .*//' $file |
sort -f |
uniq -i -D |
fgrep -i -f - $file

And if the number of duplicates is really big, you can squeeze them down with:

sed 's/        .*//' $file |
sort -f |
uniq -i -D |
sort -f -u |
fgrep -i -f - $file

Given the input data:

What a surprise?        Vous etes surpris?
What are you doing?        Che cosa stai facendo?
WHAT ARE YOU DOING?        Che diavolo stai facendo?
Provacation         Provacatore
what are you doing?        Qual è il tuo problema amico?
Ambiguous        Ambiguere

the output from all of these is:

What are you doing?        Che cosa stai facendo?
WHAT ARE YOU DOING?        Che diavolo stai facendo?
what are you doing?        Qual è il tuo problema amico?

jim mcnamara · Answer

or this:

unique:

awk '!arr[tolower($1)]++'  inputfile > unique.txt

duplicates

awk '{arr[tolower($1)]++; next} 
END{for (i in arr {if(arr[i]>1){print i, "count:", arr[i]}} }' inputfile > dup.txt

Using linux command "sort -f | uniq -i" together for ignoring case

Tags:

linux

sorting

awk

uniq

gawk

Steve3p0

3 Answers

stefansson

Jonathan Leffler

jim mcnamara

Recent Activity

Donate For Us

Using linux command "sort -f | uniq -i" together for ignoring case

Tags:

linux

sorting

awk

uniq

gawk

Steve3p0

3 Answers

stefansson

Jonathan Leffler

jim mcnamara

Related questions

Recent Activity

Donate For Us