Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

hexadecimal literals in awk patterns

Tags:

unix

macos

awk

awk is capable of parsing fields as hexadecimal numbers:

$ echo "0x14" | awk '{print $1+1}'
21 <-- correct, since 0x14 == 20

However, it does not seem to handle actions with hexadecimal literals:

$ echo "0x14" | awk '$1+1<=21 {print $1+1}' | wc -l
1 <-- correct
$ echo "0x14" | awk '$1+1<=0x15 {print $1+1}' | wc -l
0 <-- incorrect.  awk is not properly handling the 0x15 here

Is there a workaround?

like image 548
SheetJS Avatar asked Mar 22 '23 23:03

SheetJS


1 Answers

You're dealing with two similar but distinct issues here, non-decimal data in awk input, and non-decimal literals in your awk program.

See the POSIX-1.2004 awk specification, Lexical Conventions:

8. The token NUMBER shall represent a numeric constant. Its form and numeric value [...]
   with the following exceptions:
    a. An integer constant cannot begin with 0x or include the hexadecimal digits 'a', [...]

So awk (presumably you're using nawk or mawk) behaves "correctly". gawk (since version 3.1) supports non-decimal (octal and hex) literal numbers by default, though using the --posix switch turns that off, as expected.

The normal workaround in cases like this is to use the defined numeric string behaviour, where a numeric string is to effectively be parsed as the C standard atof() or strtod() function, that supports 0x-prefixed numbers:

$ echo "0x14" | nawk '$1+1<=0x15 {print $1+1}'
<no output>
$ echo "0x14" | nawk '$1+1<=("0x15"+0) {print $1+1}'
21

The problem here is that that isn't quite correct, as POSIX-1.2004 also states:

A string value shall be considered a numeric string if it comes from one of the following: 
   1. Field variables
   ...
and after all the following conversions have been applied, the resulting string would 
lexically be recognized as a NUMBER token as described by the lexical conventions in Grammar

UPDATE: gawk aims for "2008 POSIX.1003.1", note however since the 2008 edition (see the IEEE Std 1003.1 2013 edition awk here) allows strtod() and implementation-dependent behaviour that does not require the number to conform to the lexical conventions. This should (implicitly) support INF and NAN too. The text in Lexical Conventions is similarly amended to optionally allow hexadecimal constants with 0x prefixes.

This won't behave (given the lexical constraint on numbers) quite as hoped in gawk:

$ echo "0x14" | gawk  '$1+1<=0x15 {print $1+1}'
1

(note the "wrong" numeric answer, which would have been hidden by |wc -l) unless you use --non-decimal-data too:

$ echo "0x14" | gawk --non-decimal-data '$1+1<=0x15 {print $1+1}'
21

See also:

  • https://www.gnu.org/software/gawk/manual/html_node/Nondecimal_002dnumbers.html
  • http://www.gnu.org/software/gawk/manual/html_node/Variable-Typing.html

This accepted answer to this SE question has a portability workaround.

The options for having the two types of support for non-decimal numbers are:

  • use only gawk, without --posix and with --non-numeric-data
  • implement a wrapper function to perform hex-to-decimal, and use this both with your literals and on input data

If you search for "awk dec2hex" you can find many instances of the latter, a passable one is here: http://www.tek-tips.com/viewthread.cfm?qid=1352504 . If you want something like gawk's strtonum(), you can get a portable awk-only version here.

like image 173
mr.spuratic Avatar answered Apr 01 '23 01:04

mr.spuratic