Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

number of unique words in a document

Tags:

text

grep

bash

I have a very large txt file (500GiB), and I want to get the number of its unique words. I tried this, but it seems to be very slow as it does sort:

grep -o -E '\w+' temp | sort -u -f | wc -l

Is there any better way of doing this?

like image 898
user3639557 Avatar asked Nov 01 '25 15:11

user3639557


2 Answers

You can rely on awk's default behavior to split lines into words by runs of whitespace, and use its associative arrays:

awk '{ for (i=1; i<=NF; ++i) a[tolower($i)]++ } END { print length(a) }' file

Update: As @rici points out in a comment, white-space separated tokens may include punctuation other than _ and other characters, and are thus not necessarily the same as grep's \w+ construct. @4ae1e1 therefore suggests using a field separator of along the lines of '[^[:alnum:]_]'. Note that this will result in each component of a hyphenated word to be counted separately; similarly, ' separates words.

awk -F '[^[:alnum:]_]+' '{ for (i=1; i<=NF; ++i) { a[tolower($i)]++ } }
        END { print length(a) - ("" in a) }' file
  • Associative array a is built in a way that counts the occurrence of each distinct word encountered in the input, converted to lowercase first so as to ignore differences in case - if you do NOT want to ignore case differences, simply remove the tolower() call.
    • CAVEAT: It seems that Mawk and BSD Awk aren't locale-aware, so tolower() won't work properly with non-ASCII characters.
  • On having processed all words, the number of elements of a equals the number of unique words.
    • NOTE: The POSIX-compliant reformulation of print length(a) is: for (k in a) ++count; print count

The above will work with GNU Awk, Mawk (1.3.4+), and BSD Awk, even though it isn't strictly POSIX-compliant (POSIX defines the length function only for strings, not arrays).

like image 141
mklement0 Avatar answered Nov 04 '25 20:11

mklement0


awk to the rescue!

$ awk -v RS=" " '{a[$0]++} END{for(k in a) sum++; print sum}' file

UPDATE:

It's probably better to do preprocessing with tr and let the awk do the counting economically. You may want to delimit the words with spaces or new lines.

For example:

$ tr ':;,?!\"' ' ' < file | tr -s ' ' '\n' | awk '!a[$0]++{c++} END{print c}'
like image 25
karakfa Avatar answered Nov 04 '25 20:11

karakfa



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!