Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Re-evaluating fields in a record after awk field separator change

Tags:

macos

awk

(This is my first post here, so please forgive me if I am asking the question the wrong way.)

I am learning awk on my OSX Maverick. I am going through this tutorial on awk.

I am trying to reproduce something similar to the awk_example4a.awk in that tutorial.

So I came up with this awk program/script/arguments (not sure what you call it??):

BEGIN { i=1 }
{
    print "Line " i;
    print "$1 is " $1,"\n$2 is " $2, "\n$3 is " $3;
    FS=":";
    $0=$0;
    print "With the new FS - line " i;
    print "$1 is " $1,"\n$2 is " $2, "\n$3 is " $3;
    FS=" ";
    i++;
}

And the input file looks like this:

A1 B1:B2 C2
A1:A2 B2:B3 C3

What I am trying to do is to process each line/record first with the default FS (whitespace), and then re-process the same with a new FS (":"), then restore the default FS before going to the next record.

According to the tutorial, $0=$0 is supposed to get awk to re-evaluate the fields using the new field separator, and thus supposedly giving me an output that looks like this:

Line 1
$1 is A1 
$2 is B1:B2 
$3 is C2
With the new FS - line 1
$1 is A1 B1
$2 is B2 C2
$3 is
Line 2
$1 is A1:A2 
$2 is B2:B3 
$3 is C3
With the new FS - line 2
$1 is A1
$2 is A2 B2
$3 is B3 C3

But instead, I get:

Line 1
$1 is A1 
$2 is B1:B2 
$3 is C2
With the new FS - the line 1
$1 is A1 
$2 is B1:B2 
$3 is C2
Line 2
$1 is A1:A2 
$2 is B2:B3 
$3 is C3
With the new FS - the line 2
$1 is A1:A2 
$2 is B2:B3 
$3 is C3

i.e. the fields have not been re-evaluated after the FS was changed.

So if $0=$0 doesn't work (and nor do things like $1=$1; $2=$2), how do I get awk to re-evaluate the same line using a different FS?

Thank you.

like image 492
LD99 Avatar asked Nov 01 '22 20:11

LD99


1 Answers

tl; dr:

FreeBSD/OS X awk doesn't apply changes to FS (the field separator) until after the current record has finished processing - this behavior is actually POSIX-mandated (see below).

Workaround: Do not change FS and use function split() instead:

{
    print "Line " ++i
    print "$1 is " $1 "\n$2 is " $2 "\n$3 is " $3
    split($0, flds, ":")   # split current line by ':' into array `flds`
    print "With the new FS - line " i
    print "field1 is " flds[1] "\nfield2 is " flds[2] "\nfield3 is " flds[3]
}
  • Note how the BEGIN block was eliminated by relying on uninitialized variables defaulting to 0 in numeric contexts.
  • The , instances were removed from the print statements, because each would insert a space (the default value of the output-field separator, OFS), which is not needed in this case.
  • Given that the statements are newline-separated, ; is not needed to terminate them.

Read on for the fun multi-platform compatibility details.


The POSIX spec. for awk states (emphasis mine):

Before the first reference to a field in the record is evaluated, the record shall be 
split into fields, according to the rules in Regular Expressions, 
**using the value of FS that was current at the time the record was read**.

With respect to assigning a new value to $0 or a specific field, the same source states:

The symbol $0 shall refer to the entire record; setting any other field causes 
the re-evaluation of $0. Assigning to $0 shall reset the values of all other
fields and the NF built-in variable.

In other words: Given that the re-assignment case doesn't state otherwise, the only reference to the scope of a given FS value in the POSIX spec. mandates that it be constant for a given input record. There is definitely ambiguity, and it would certainly help if the spec. resolved that - that said, the conservative and thus safer interpretation is to assume a constant-while-processing-a-given-record FS.

As such, it is FreeBSD/OS X awk that is the model citizen, whereas GNU awk and also mawk offer more flexibility by NOT playing by the rules and applying FS changes even to the current record on re-assigning to $0 or any specific field.

Note, however, that GNU awk (as of v4.1.1) doesn't even change that behavior with the --posix option, whose express intent is to result in POSIX-compliant behavior. If I'm reading the POSIX spec. correctly (do tell me whether I am), this should be considered a bug.

like image 104
mklement0 Avatar answered Nov 15 '22 07:11

mklement0