Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can ngrams be generated in bash?

Tags:

bash

I've found various implementations of ngrams in Python, Perl, etc., but I'd really like something in a bash script. I ran across the "Missing textutils" version, but that only lists the ngrams, it doesn't count them by frequency, which is fairly central to using ngrams -- or at least to my usage. I just want a basic list of results with their frequency, like this...

17 blue car
14 red car
5  and the
2  brown monkey
1  orange car

Anybody have something like that lying around that they could post? Thanks!

like image 310
user1889034 Avatar asked Aug 31 '25 10:08

user1889034


1 Answers

Yes, ngrams can be implemented in bash.

# Usage: ngrams N < FILE
ngrams () { 
  local N=$1
  local line
  set --
  while read line; do
    set -- $* $line
    while [[ -n ${*:$N} ]]; do
      echo ${*:1:$N}
      shift
    done
  done |
  sort | uniq -c
}

$ ngrams 2
Here is some text, and here is
some more text, and here is yet
some more text
  1 Here is
  2 and here
  2 here is
  2 is some
  1 is yet
  1 more text
  1 more text,
  2 some more
  1 some text,
  2 text, and
  1 yet some

Note: the above is a function, not a script (perhaps this question helps, or maybe there is another one which is better). Here's the script version:

#!/bin/bash
# Usage: ngrams N < FILE
N=$1
set --
while read line; do
  set -- $* $line
  while [[ -n ${*:$N} ]]; do
    echo ${*:1:$N}
    shift
  done
done |
sort | uniq -c
like image 56
rici Avatar answered Sep 03 '25 04:09

rici