number of unique words in a document

Question

I have a very large txt file (500GiB), and I want to get the number of its unique words. I tried this, but it seems to be very slow as it does sort:

grep -o -E '\w+' temp | sort -u -f | wc -l

Is there any better way of doing this?

mklement0 · Accepted Answer

You can rely on awk's default behavior to split lines into words by runs of whitespace, and use its associative arrays:

awk '{ for (i=1; i<=NF; ++i) a[tolower($i)]++ } END { print length(a) }' file

^{Update: As @rici points out in a comment, white-space separated tokens may include punctuation other than _ and other characters, and are thus not necessarily the same as grep's \w+ construct. @4ae1e1 therefore suggests using a field separator of along the lines of '[^[:alnum:]_]'. Note that this will result in each component of a hyphenated word to be counted separately; similarly, ' separates words.}

awk -F '[^[:alnum:]_]+' '{ for (i=1; i<=NF; ++i) { a[tolower($i)]++ } }
        END { print length(a) - ("" in a) }' file

Associative array a is built in a way that counts the occurrence of each distinct word encountered in the input, converted to lowercase first so as to ignore differences in case - if you do NOT want to ignore case differences, simply remove the tolower() call.
- CAVEAT: It seems that Mawk and BSD Awk aren't locale-aware, so tolower() won't work properly with non-ASCII characters.
On having processed all words, the number of elements of a equals the number of unique words.
- NOTE: The POSIX-compliant reformulation of print length(a) is: for (k in a) ++count; print count

The above will work with GNU Awk, Mawk (1.3.4+), and BSD Awk, even though it isn't strictly POSIX-compliant (POSIX defines the length function only for strings, not arrays).

karakfa · Answer

awk to the rescue!

$ awk -v RS=" " '{a[$0]++} END{for(k in a) sum++; print sum}' file

UPDATE:

It's probably better to do preprocessing with tr and let the awk do the counting economically. You may want to delimit the words with spaces or new lines.

For example:

$ tr ':;,?!\"' ' ' < file | tr -s ' ' '
' | awk '!a[$0]++{c++} END{print c}'

number of unique words in a document

Tags:

text

grep

bash

user3639557

2 Answers

mklement0

karakfa

Recent Activity

Donate For Us

number of unique words in a document

Tags:

text

grep

bash

user3639557

2 Answers

mklement0

karakfa

Related questions

Recent Activity

Donate For Us