I have a very large txt file (500GiB), and I want to get the number of its unique words. I tried this, but it seems to be very slow as it does sort:
grep -o -E '\w+' temp | sort -u -f | wc -l
Is there any better way of doing this?
You can rely on awk's default behavior to split lines into words by runs of whitespace, and use its associative arrays:
awk '{ for (i=1; i<=NF; ++i) a[tolower($i)]++ } END { print length(a) }' file
Update: As @rici points out in a comment, white-space separated tokens may include punctuation other than _ and other characters, and are thus not necessarily the same as grep's \w+ construct. @4ae1e1 therefore suggests using a field separator of along the lines of '[^[:alnum:]_]'. Note that this will result in each component of a hyphenated word to be counted separately; similarly, ' separates words.
awk -F '[^[:alnum:]_]+' '{ for (i=1; i<=NF; ++i) { a[tolower($i)]++ } }
END { print length(a) - ("" in a) }' file
a is built in a way that counts the occurrence of each distinct word encountered in the input, converted to lowercase first so as to ignore differences in case - if you do NOT want to ignore case differences, simply remove the tolower() call.
tolower() won't work properly with non-ASCII characters.a equals the number of unique words.
print length(a) is: for (k in a) ++count; print countThe above will work with GNU Awk, Mawk (1.3.4+), and BSD Awk, even though it isn't strictly POSIX-compliant (POSIX defines the length function only for strings, not arrays).
awk to the rescue!
$ awk -v RS=" " '{a[$0]++} END{for(k in a) sum++; print sum}' file
UPDATE:
It's probably better to do preprocessing with tr and let the awk do the counting economically. You may want to delimit the words with spaces or new lines.
For example:
$ tr ':;,?!\"' ' ' < file | tr -s ' ' '\n' | awk '!a[$0]++{c++} END{print c}'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With