I want to calculate the frequency of the words from a file, where the words are one by line. The file is really big, so this might be the problem (it counts 300k lines in this example).
I do this command:
cat .temp_occ | uniq -c | sort -k1,1nr -k2 > distribution.txt
and the problem is that it gives me a little bug: it considers the same words as different.
For example, the first entries are:
306 continua
278 apertura
211 eventi
189 murah
182 giochi
167 giochi
with giochi
repeated twice as you can see.
At the bottom of the file it becomes even worse and it looks like this:
1 win
1 win
1 win
1 win
1 win
1 win
1 win
1 win
1 win
1 winchester
1 wind
1 wind
for all the words.
What am I doing wrong?
Try to sort first:
cat .temp_occ | sort| uniq -c | sort -k1,1nr -k2 > distribution.txt
Or use "sort -u" which also eliminates duplicates. See here.
The size of the file has nothing to do with what you're seeing. From the man page of uniq(1):
Note: 'uniq' does not detect repeated lines unless they are adjacent. You may want to sort the input first, or use 'sort -u' without 'uniq'. Also, comparisons honor the rules specified by 'LC_COLLATE'.`
So running uniq
on
a
b
a
will return:
a
b
a
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With