I've found various implementations of ngrams in Python, Perl, etc., but I'd really like something in a bash script. I ran across the "Missing textutils" version, but that only lists the ngrams, it doesn't count them by frequency, which is fairly central to using ngrams -- or at least to my usage. I just want a basic list of results with their frequency, like this...
17 blue car
14 red car
5 and the
2 brown monkey
1 orange car
Anybody have something like that lying around that they could post? Thanks!
Yes, ngrams can be implemented in bash.
# Usage: ngrams N < FILE
ngrams () {
local N=$1
local line
set --
while read line; do
set -- $* $line
while [[ -n ${*:$N} ]]; do
echo ${*:1:$N}
shift
done
done |
sort | uniq -c
}
$ ngrams 2
Here is some text, and here is
some more text, and here is yet
some more text
1 Here is
2 and here
2 here is
2 is some
1 is yet
1 more text
1 more text,
2 some more
1 some text,
2 text, and
1 yet some
Note: the above is a function, not a script (perhaps this question helps, or maybe there is another one which is better). Here's the script version:
#!/bin/bash
# Usage: ngrams N < FILE
N=$1
set --
while read line; do
set -- $* $line
while [[ -n ${*:$N} ]]; do
echo ${*:1:$N}
shift
done
done |
sort | uniq -c
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With