awk FS vs FPAT puzzle and counting words but not blank fields

Question

Suppose I have the file:

$ cat file
This, that;
this-that or this.

(Punctuation at the line end is not always there...)

Now I want to count words (with words being defined as one or more ascii case-insensitive letters.) In typical POSIX *nix you could do:

sed -nE 's/[^[:alpha:]]+/ /g; s/ $//p' file | tr ' ' "
"  | tr '[:upper:]' '[:lower:]' | sort | uniq -c
   1 or
   2 that
   3 this

With grep you can shorten that a bit to only match what you define as a word:

grep -oE '[[:alpha:]]+' file | tr '[:upper:]' '[:lower:]' | sort | uniq -c
# same output

With GNU awk, you can use FPAT to replicate matching only what you want (ignore sorting...):

gawk -v FPAT="[[:alpha:]]+" '
{for (i=1;i<=NF;i++) {seen[tolower($i)]++}}
END {for (e in seen) printf "%4s %s
", seen[e], e}' file
   3 this
   1 or
   2 that

Now trying to replicate in POSIX awk I tried:

awk 'BEGIN{FS="[^[:alpha:]]+"}
{ for (i=1;i<=NF;i++) seen[tolower($i)]++ }
END {for (e in seen) printf "%4s %s
", seen[e], e}' file
   2 
   3 this
   1 or
   2 that

Note the 2 with blank at top. This is from having blank fields from ; at the end of line 1 and . at the end of line 2. If you delete the punctuation at line's end, this issue goes away.

You can partially fix it (for all but the last line) by setting RS="" in the awk, but still get a blank field with the last (only) line.

I can also fix it this way:

awk 'BEGIN{FS="[^[:alpha:]]+"}
{ for (i=1;i<=NF;i++) if ($i) seen[tolower($i)]++ }
END {for (e in seen) printf "%4s %s
", seen[e], e}' file

Which seems a little less than straight forward.

Is there an idiomatic fix I am missing to make POSIX awk act similarly to GNU awk's FPAT solution here?

anubhava · Accepted Answer

This should work in POSIX/BSD or any version of awk:

awk -F '[^[:alpha:]]+' '
{for (i=1; i<=NF; ++i) ($i != "") && ++count[tolower($i)]}
END {for (e in count) printf "%4s %s
", count[e], e}' file

   1 or
   3 this
   2 that

By using -F '[^[:alpha:]]+' we are splitting fields on any non-alpha character.
($i != "") condition will make sure to count only non-empty fields in seen.

glenn jackman · Answer

With POSIX awk, I'd use match and the builtin RSTART and RLENGTH variables:

#!awk
{
    s = $0
    while (match(s, /[[:alpha:]]+/)) {
        word = substr(s, RSTART, RLENGTH)
        count[tolower(word)]++
        s = substr(s, RSTART+RLENGTH)
    }
}
END {
    for (word in count) print count[word], word
}

$ awk -f countwords.awk file
1 or
3 this
2 that

Works with the default BSD awk on my Mac.

RavinderSingh13 · Answer

With your shown samples, please try following awk code. Written and tested in GNU awk in case you are ok to do this with RS approach.

awk -v RS='[[:alpha:]]+' '
RT{
  val[tolower(RT)]++
}
END{
  for(word in val){
    print val[word], word
  }
}
' Input_file

Explanation: Simple explanation would be, using RS variable of awk to make record separator as [[:alpha:]] then in main program creating array val whose index is RT variable and keep counting its occurrences with respect to same index in array val. In END block of this program traversing through array and printing indexes with its respective values.

James Brown · Answer

Using RS instead:

$ gawk -v RS="[^[:alpha:]]+" '  # [^a-zA-Z] or something for some awks
$0 {                            # remove possible leading null string
    a[tolower($0)]++
}
END {
    for(i in a)
        print i,a[i]
}' file

Output:

this 3
or 1
that 2

Tested successfully on gawk and Mac awk (version 20200816) and on mawk and busybox awk using [^a-zA-Z]

Carlos Pascual · Answer

With GNU awk using patsplit() and a second array for counting, you can try this:

awk 'patsplit($0, a, /[[:alpha:]]+/) {for (i in a) b[ tolower(a[i]) ]++} END {for (j in b) print b[j], j}' file
3 this
1 or
2 that

awk FS vs FPAT puzzle and counting words but not blank fields

Tags:

bash

awk

dawg

5 Answers

anubhava

glenn jackman

RavinderSingh13

James Brown

Carlos Pascual

Recent Activity

Donate For Us

awk FS vs FPAT puzzle and counting words but not blank fields

Tags:

bash

awk

dawg

5 Answers

anubhava

glenn jackman

RavinderSingh13

James Brown

Carlos Pascual

Related questions

Recent Activity

Donate For Us