Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

awk make it less system dependant

Tags:

c

bash

awk

If I'm not mistaken, awk parses a number depending on the OS language (eg,echo "1,2" | awk '{printf("%f\n",$1)}' would be interpreted as 1 in an english system and as 1.2 in a system where a comma separates the integer from the decimal part).

I don't know if the C printf does this too, so I added the C tag.

I would like to modify the previous command so that it returns the same value (1.2) regardless of the system being used.

like image 289
bob Avatar asked Apr 25 '12 17:04

bob


2 Answers

Welcome to the ugliness of locale. To fix your problem, first set the locale to the C one.

export LC_NUMERIC=C
echo "1,2" | awk '...your code...'

To turn off other locale-dependent tomfoolery, you can

export LC_ALL=C
like image 158
Dave Avatar answered Sep 22 '22 01:09

Dave


If you're using gawk, you can use the --use-lc-numeric option.

$ LC_NUMERIC=de_DE.UTF-8 awk 'BEGIN {printf("%f\n", "1,2")}'
1.000000
$ LC_NUMERIC=de_DE.UTF-8 awk --use-lc-numeric 'BEGIN {printf("%f\n", "1,2")}'
1,200000

From the GAWK manual

The POSIX standard says that awk always uses the period as the decimal point when reading the awk program source code, and for command-line variable assignments (see Other Arguments). However, when interpreting input data, for print and printf output, and for number to string conversion, the local decimal point character is used. Here are some examples indicating the difference in behavior, on a GNU/Linux system:

 $ gawk 'BEGIN { printf "%g\n", 3.1415927 }'
 -| 3.14159
 $ LC_ALL=en_DK gawk 'BEGIN { printf "%g\n", 3.1415927 }'
 -| 3,14159
 $ echo 4,321 | gawk '{ print $1 + 1 }'
 -| 5
 $ echo 4,321 | LC_ALL=en_DK gawk '{ print $1 + 1 }'
 -| 5,321

The ‘en_DK’ locale is for English in Denmark, where the comma acts as the decimal point separator. In the normal "C" locale, gawk treats ‘4,321’ as ‘4’, while in the Danish locale, it's treated as the full number, 4.321.

Some earlier versions of gawk fully complied with this aspect of the standard. However, many users in non-English locales complained about this behavior, since their data used a period as the decimal point, so the default behavior was restored to use a period as the decimal point character. You can use the --use-lc-numeric option (see Options) to force gawk to use the locale's decimal point character. (gawk also uses the locale's decimal point character when in POSIX mode, either via --posix, or the POSIXLY_CORRECT environment variable.)

I get similar behavior from /usr/bin/printf

$ LC_NUMERIC=de_DE.UTF-8 /usr/bin/printf "%f\n" "1,2"
/usr/bin/printf: 1,2: value not completely converted
1,000000
$ LC_NUMERIC=de_DE.UTF-8 /usr/bin/printf "%f\n" "1.2"
1,200000

But without the ability to override it.

If your intent is to do the opposite, that is to take "European" input and output "US" numbers, you're going to need to use something more robust. Possible Python or Perl with their locale modules.

like image 26
Dennis Williamson Avatar answered Sep 20 '22 01:09

Dennis Williamson