Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use multiple passes with gawk?

Tags:

awk

gawk

I'm trying to use GAWK from CYGWIN to process a csv file. Pass 1 finds the max value, and pass 2 prints the records that match the max value. I'm using a .awk file as input. When I use the text in the manual, it matches on both passes. I can use the IF form as a workaround, but that forces me to use IF inside every pattern match, which is kind of a pain. Any idea what I'm doing wrong?

Here's my .awk file:

pass == 1
{
    print "pass1 is", pass;  
}    

pass == 2
{
if(pass == 2)
    print "pass2 is", pass;  
}    

Here's my output (input file is just "hello):

hello
pass1 is 1
pass1 is 2
hello
pass2 is 2

Here's my command line:

gawk -F , -f test.awk pass=1 x.txt pass=2 x.txt

I'd appreciate any help.

like image 769
Steve Kolokowsky Avatar asked Dec 08 '15 17:12

Steve Kolokowsky


2 Answers

An (g)awk solution might look like this:

awk 'FNR == NR{print "1st pass"; next}
     {print "second pass"}' x.txt x.txt

(Please replace awk by gawk if necessary.)
Let's say, you wanted to search the maximum value in the first column of file x.txt and then print all lines which have this value in the first column, your program might look like this (thank to Ed Morton for some tip, see comment):

awk -F"," 'FNR==NR {max = ( (FNR==1) || ($1 > max) ? $1 : max ); next}
           $1==max'  x.txt x.txt

The output for x.txt:

6,5
2,6
5,7
6,9

is

6,5
6,9

How does this work? The variable NR keeps increasing with every record, whereas FNR is reset to 1 when reading a new file. Therefore, FNR==NR is only true for the first file processed.

like image 108
F. Knorr Avatar answered Dec 26 '22 10:12

F. Knorr


So... F.Knorr answered your question accurately and concisely, and he deserves a big green checkmark. NR==FNR is exactly the secret sauce you're looking for.

But here is a different approach, just in case the multi-pass thing proves to be problematic. (Perhaps you're reading the file from a slow drive, a USB stick, across a network, DAT tape, etc.)

awk -F, '$1>m{delete l;n=0;m=$1}m==$1{l[++n]=$0}END{for(i=1;i<=n;i++)print l[i]}' inputfile

Or, spaced out for easier reading:

BEGIN {
  FS=","
}

$1 > max {
  delete list           # empty the array
  n=0                   # reset the array counter
  max=$1                # set a new max
}

max==$1 {
  list[++n]=$0          # record the line in our array
}

END {
  for(i=1;i<=n;i++) {   # print the array in order of found lines.
    print list[i]
  }
}

With the same input data that F.Knorr tested with, I get the same results.

The idea here is that go through the file in ONE pass. We record every line that matches our max in an array, and if we come across a value that exceeds the max, we clear the array and start collecting lines afresh.

This approach is heaver on CPU and memory (depending on the size of your dataset), but being single pass, it is likely to be lighter on IO.

like image 22
ghoti Avatar answered Dec 26 '22 12:12

ghoti