Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In AWK, why does a nonexistent field like $(NF+1) not equal zero?

Tags:

awk

When using AWK, I'm struggling to understand why a nonexistent field (a field after $NF) does not compare equal to numeric zero.

In the example below, the input line has two fields, so according to the spec $3 should be an "uninitialized value" and compare equal to 0. In other words, $3 == 0 should return true, but as you can see below it returns false:

$ echo '1 2' | awk '{ print($3 == 0 ? "t" : "f") }'
f

Both "One True AWK" (version 20121220) and GNU AWK (version 4.2.1) behave the same way. Here's the GNU AWK output:

$ echo '1 2' | gawk '{ print($3 == 0 ? "t" : "f") }'
f

According to the POSIX AWK spec, nonexistent fields like $3 should be uninitialized values:

References to nonexistent fields (that is, fields after $NF), shall evaluate to the uninitialized value.

Additionally, comparisons like == should be made numerically if one operand is numeric and the other is an uninitialized value:

Comparisons (with the '<', "<=", "!=", "==", '>', and ">=" operators) shall be made numerically if both operands are numeric, if one is numeric and the other has a string value that is a numeric string, or if one is numeric and the other has the uninitialized value. Otherwise, operands shall be converted to strings as required...

And finally, an uninitialized value's "numeric value" should be zero:

An uninitialized value shall have both a numeric value of zero and a string value of the empty string.

Contrast this to an uninitialized variable, which does compare equal to zero:

$ awk 'BEGIN { print(x == 0 ? "t" : "f") }'
t

So in our first example, $3 should be an uninitialized value, == should compare it numerically, and its numeric value should be zero. Hence it seems to me that $3 == 0 ? "t" : "f" should output t instead of f.

Can anyone help my understand why it doesn't, or help me see how I'm misreading the spec?

like image 514
Ben Hoyt Avatar asked Aug 01 '18 12:08

Ben Hoyt


People also ask

What is NF in awk?

NF is a predefined variable whose value is the number of fields in the current record. awk automatically updates the value of NF each time it reads a record. No matter how many fields there are, the last field in a record can be represented by $NF .

What does \t do in awk?

Hang on and follow with me so you get the flavor of AWK. The characters "\t" Indicates a tab character so the output lines up on even boundries. The "$8" and "$3" have a meaning similar to a shell script. Instead of the eighth and third argument, they mean the eighth and third field of the input line.

How do I print the same line in awk?

To print a blank line, use print "" , where "" is the empty string. To print a fixed piece of text, use a string constant, such as "Don't Panic" , as one item. If you forget to use the double-quote characters, your text is taken as an awk expression, and you will probably get an error.


2 Answers

There is an interesting passage in The AWK Programming Language by Alfred V. Aho, Brian W. Kernighan and Peter J. Weinberger (1988) (book here):

Uninitialized variables are created with the numeric value 0 and the string value "". Nonexistent fields and fields that are explicitly null have only the string value ""; they are not numeric, but when coerced to numbers they acquire the numeric value 0.

source: The AWK Programming Language, section 2.2, p 45

Furthermore:

Uninitialized variables have the numeric value 0 and the string value "". Accordingly, if x is uninitialized,

if (x) ...

is false, and

if (!x) ...
if (x == 0) ...
if (x == "") ...

are all true. But note that

if (x == "0") ...

is false.

The type of a field is determined by context when possible; for example, $1++ implies that $1 must be coerced to numeric if necessary, and $1 = $1 "," $2 implies that $1 and $2 will be coerced to strings if necessary.

In contexts where types cannot be reliably determined, e.g.,

if {$1 == $2) ...

the type of each field is determined on input. All fields are strings; in addition, each field that contains only a number is also considered numeric. Fields that are explicitly null have the string value ""; they are not numeric. Nonexistent fields (i.e., fields past NF) and $0 for blank lines are treated this way too.

As it is for fields, so it is for array elements created by split.

source: The AWK Programming Language, Appendix A, Initialization, comparison, and type coercion, p 192

In my opinion, these lines explain nicely the observed behavior and it seems that most programs follow this too.


On top of that, in addendum to the post of rici:

When investigating the source code of GNU Awk 4.2.1, I found that:

  • Uninitialized variables are assigned the Node named Nnull_string which has the flags :

    main.c: Nnull_string->flags = (MALLOC|STRCUR|STRING|NUMCUR|NUMBER);
    
  • Nonexistent fields are assigned the Node named Null_field which is a redefined Nnull_string as:

    field.c: *Null_field = *Nnull_string;
    field.c: Null_field->valref = 1;
    field.c: Null_field->flags = (STRCUR|STRING|NULL_FIELD); /* do not set MALLOC */
    

Where the fields have the values (from awk.h):

#       define  STRING  0x0002       /* assigned as string */
#       define  STRCUR  0x0004       /* string value is current */
#       define  NUMCUR  0x0008       /* numeric value is current */
#       define  NUMBER  0x0010       /* assigned as number */
#       define  NULL_FIELD 0x2000    /* this is the null field */

The comparison function int cmp_nodes(NODE *t1, NODE *t2, bool use_strcmp) defined in eval.c, just checks if the NUMBER flag is set in both t1 and t2:

if ((t1->flags & NUMBER) != 0 && (t2->flags & NUMBER) != 0)
    return cmp_numbers(t1, t2);

As the Null_field does not have the number field, it will just assume that it represents a string. This all seems to be in line with what the book cites!

Furthermore, from awk.h:

* STRING and NUMBER are mutually exclusive, except for the special
* case of an uninitialized value, represented internally by
* Nnull_string. They represent the type of a value as assigned.
* Nnull_string has both STRING and NUMBER attributes, but all other
* scalar values should have precisely one of these bits set.
*
* STRCUR and NUMCUR are not mutually exclusive. They represent that
* the particular type of value is up to date.  For example,
*
*   a = 5       # NUMBER | NUMCUR
*   b = a ""    # Adds STRCUR to a, since a string value
*               # is now available. But the type hasn't changed!
*
*   a = "42"    # STRING | STRCUR
*   b = a + 0   # Adds NUMCUR to a, since numeric value
*               # is now available. But the type hasn't changed!
like image 161
kvantour Avatar answered Sep 27 '22 20:09

kvantour


As far as I can see, you're reading the Posix spec correctly. The Posix spec is based on The AWK Programming Language (which is included as an informative reference) but seeks to make certain aspects of the language more precise. In particular, previous practices for dealing with string values and number values lead to some curious consequences, some of which are noted in the Rationale section of the Posix utility description. The opinion of the Posix authors is that "[t]he behavior of historical implementations was seen as too unintuitive and unpredictable," and looking at one of the examples, it is hard to disagree:

$ seq 1 4 | nawk '{
>     a = "+2"
>     b = 2
>     if (NR % 2)
>         c = a + b
>     if (a == b)
>         print "numeric comparison"
>     else
>         print "string comparison"
> }
> '
numeric comparison
string comparison
numeric comparison
string comparison

The precise handling of empty and unspecified field values is one of the differences between the Posix spec and the awk language defined by The Awk Programming Language. So in the end, you will have to decide which specification you consider definitive.

As you note, Posix says clearly that: (Variables and special values)

References to nonexistent fields (that is, fields after $NF), shall evaluate to the uninitialized value.…

In fact, it's not just invalid fields which receive this treatment. Although empty strings are not "numeric strings" as defined by Posix [Note 1], an exception is made for empty fields (which are possible if you explicitly set the field separator):

Each field variable shall have a string value or an uninitialized value when created. Field variables shall have the uninitialized value when created from $0 using FS and the variable does not contain any characters.

Comparison operators are numeric if one argument is a number and the other is a number, a "numeric string" or an uninitialized value: (Expressions in awk, emphasis added):

Comparisons (with the '<', "<=", "!=", "==", '>', and ">=" operators) shall be made numerically if both operands are numeric, if one is numeric and the other has a string value that is a numeric string, or if one is numeric and the other has the uninitialized value. Otherwise, operands shall be converted to strings as required and a string comparison shall be made…

However, that is not the Gnu awk implementation, and it is apparently not the implementation in many other awks. Common implementations:

  • Treat empty and invalid fields as the empty string (which is not a numeric string) rather than an unitialized value; and

  • Compare two "numeric strings" using numeric comparison, not string comparison.

I can't find an archive of an awk mailing list that goes back far enough in time, and the source history on Savannah only goes back to 2006 or so, but the Changelog includes the following entry from 1997:

Sun Jan 19 23:37:03 1997 Arnold D. Robbins

* field.c (get_field): Add new var that is like Nnull_string but
  does not have numeric attributes, so that new fields are strings.

And the code still reflects that decision. (Nnull_string is gawk's uninitialized value. The variable referred to is now the global Null_field.)

Interestingly, in a BEGIN rule, gawk (correctly) treats $0 as uninitialized rather than empty:

$ gawk 'BEGIN{print $0 == 0, $1 == 0}'
1 0

Notes

  1. A "numeric string" is a string originating from user input whose form is that of a number. This does not include quoted literals in an awk program; "1" is a string, not a numeric string. The possible origins of a numeric string are listed in the "Expressions in awk" section referenced above; they include fields, environment variables and command-line options, and the attribute is preserved by assignment.

    Having the form of a number is also defined in that section, where implementations are given two options:

    • Use the equivalent of strtod, with the additional constraint that the number parsed must consist of at least one character and that all trailing characters be whitespace;

    • Use the lexical definition of NUMBER from the awk grammar.

    Neither of these possibilities allows an empty string to be a numeric string.

like image 34
rici Avatar answered Sep 27 '22 20:09

rici