Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

awk FS vs FPAT puzzle and counting words but not blank fields

Tags:

bash

awk

Suppose I have the file:

$ cat file
This, that;
this-that or this.

(Punctuation at the line end is not always there...)

Now I want to count words (with words being defined as one or more ascii case-insensitive letters.) In typical POSIX *nix you could do:

sed -nE 's/[^[:alpha:]]+/ /g; s/ $//p' file | tr ' ' "\n"  | tr '[:upper:]' '[:lower:]' | sort | uniq -c
   1 or
   2 that
   3 this

With grep you can shorten that a bit to only match what you define as a word:

grep -oE '[[:alpha:]]+' file | tr '[:upper:]' '[:lower:]' | sort | uniq -c
# same output

With GNU awk, you can use FPAT to replicate matching only what you want (ignore sorting...):

gawk -v FPAT="[[:alpha:]]+" '
{for (i=1;i<=NF;i++) {seen[tolower($i)]++}}
END {for (e in seen) printf "%4s %s\n", seen[e], e}' file
   3 this
   1 or
   2 that

Now trying to replicate in POSIX awk I tried:

awk 'BEGIN{FS="[^[:alpha:]]+"}
{ for (i=1;i<=NF;i++) seen[tolower($i)]++ }
END {for (e in seen) printf "%4s %s\n", seen[e], e}' file
   2 
   3 this
   1 or
   2 that

Note the 2 with blank at top. This is from having blank fields from ; at the end of line 1 and . at the end of line 2. If you delete the punctuation at line's end, this issue goes away.

You can partially fix it (for all but the last line) by setting RS="" in the awk, but still get a blank field with the last (only) line.

I can also fix it this way:

awk 'BEGIN{FS="[^[:alpha:]]+"}
{ for (i=1;i<=NF;i++) if ($i) seen[tolower($i)]++ }
END {for (e in seen) printf "%4s %s\n", seen[e], e}' file

Which seems a little less than straight forward.

Is there an idiomatic fix I am missing to make POSIX awk act similarly to GNU awk's FPAT solution here?

like image 346
dawg Avatar asked Nov 15 '21 16:11

dawg


5 Answers

This should work in POSIX/BSD or any version of awk:

awk -F '[^[:alpha:]]+' '
{for (i=1; i<=NF; ++i) ($i != "") && ++count[tolower($i)]}
END {for (e in count) printf "%4s %s\n", count[e], e}' file

   1 or
   3 this
   2 that
  • By using -F '[^[:alpha:]]+' we are splitting fields on any non-alpha character.
  • ($i != "") condition will make sure to count only non-empty fields in seen.
like image 80
anubhava Avatar answered Oct 17 '22 02:10

anubhava


With POSIX awk, I'd use match and the builtin RSTART and RLENGTH variables:

#!awk
{
    s = $0
    while (match(s, /[[:alpha:]]+/)) {
        word = substr(s, RSTART, RLENGTH)
        count[tolower(word)]++
        s = substr(s, RSTART+RLENGTH)
    }
}
END {
    for (word in count) print count[word], word
}
$ awk -f countwords.awk file
1 or
3 this
2 that

Works with the default BSD awk on my Mac.

like image 25
glenn jackman Avatar answered Oct 17 '22 02:10

glenn jackman


With your shown samples, please try following awk code. Written and tested in GNU awk in case you are ok to do this with RS approach.

awk -v RS='[[:alpha:]]+' '
RT{
  val[tolower(RT)]++
}
END{
  for(word in val){
    print val[word], word
  }
}
' Input_file

Explanation: Simple explanation would be, using RS variable of awk to make record separator as [[:alpha:]] then in main program creating array val whose index is RT variable and keep counting its occurrences with respect to same index in array val. In END block of this program traversing through array and printing indexes with its respective values.

like image 33
RavinderSingh13 Avatar answered Oct 17 '22 02:10

RavinderSingh13


Using RS instead:

$ gawk -v RS="[^[:alpha:]]+" '  # [^a-zA-Z] or something for some awks
$0 {                            # remove possible leading null string
    a[tolower($0)]++
}
END {
    for(i in a)
        print i,a[i]
}' file

Output:

this 3
or 1
that 2

Tested successfully on gawk and Mac awk (version 20200816) and on mawk and busybox awk using [^a-zA-Z]

like image 43
James Brown Avatar answered Oct 17 '22 00:10

James Brown


With GNU awk using patsplit() and a second array for counting, you can try this:

awk 'patsplit($0, a, /[[:alpha:]]+/) {for (i in a) b[ tolower(a[i]) ]++} END {for (j in b) print b[j], j}' file
3 this
1 or
2 that
like image 1
Carlos Pascual Avatar answered Oct 17 '22 02:10

Carlos Pascual