Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does AWK not treat this array index as a number unless I use int()?

Tags:

arrays

bash

awk

I have genomics files of the following type:

$ cat test-file_long.txt 
2 41647 A G
2 45895 A G
2 45953 T C
2 224919 A G
2 230055 C G
2 233239 A G
2 234130 T G
2 23454 T C

When I use the following short AWK script, it does not return all of the elements which are greater than the element used in the if statement:

{
    a[$2]
}
END{
    for (i in a){
    if(i > 45895) 
    print i
    }
}

The script returns this:

$ awk -f practice.awk test-file_long.txt 
45953

However, when I change the if statement using int(), it returns the lines that are in fact greater than, as I want:

{
    a[$2]
}
END{
    for (i in a){
    if(int(i) > 45895) 
    print i
    }
}

Result:

$ awk -f practice.awk test-file_long.txt 
233239
230055
234130
224919
45953

It appears it is only making the comparison with the first digit, and if they are the same it looks at the next digit, but it does not process the whole number. Can someone explain to me what it is about the internal mechanism of the associative array that it does not make the numeric >/< comparison unless I specify that I want the int() of the array element? What if my array elements were floats and int() was not an option?

like image 707
isosceleswheel Avatar asked Apr 24 '14 15:04

isosceleswheel


People also ask

How do you declare an array in awk?

In awk , you don't need to specify the size of an array before you start to use it. Additionally, any number or string in awk may be used as an array index, not just consecutive integers. In most other languages, you have to declare an array and specify how many elements or components it contains.

Is awk 0 or 1 indexed?

It's not a duality awk accepts either 0 or 1 , gawk at least (and I suspect all modern awks) converts anything that's not a natural number to a natural number and use that as the index so any number less than 1 like 0 or -1, or any non-numeric string like "foo bar" get converted to 1, and any non integer numbers like ...

How arrays are processed using awk?

AWK has associative arrays and one of the best thing about it is – the indexes need not to be continuous set of number; you can use either string or number as an array index. Also, there is no need to declare the size of an array in advance – arrays can expand/shrink at runtime.

Does awk have arrays?

The awk language provides one-dimensional arrays for storing groups of related strings or numbers. Every awk array must have a name. Array names have the same syntax as variable names; any valid variable name would also be a valid array name.


1 Answers

Array keys in awk are strings, so alphabetical comparison is being done here. In your first example, 459 is greater than 458 alphabetically, so it passes the test.

If your only goal is to print the lines whose 2nd column is > 45895 numerically, this would do:

awk '$2 > 45895' test-file_long.txt

Variables change type depending on the context in which they are evaluated. So by putting a variable in an explicitly numeric context, it will be treated as such. @glenn's suggestion of i+0 demonstrates this perfectly.

Alternatively, the unary plus operator +i can be used to convert an expression to a number. So your longer example could be changed to:

awk '{a[$2]} END { for (i in a) { if (+i > 45895) print i } }' test-file_long.txt
like image 97
Tom Fenech Avatar answered Oct 13 '22 13:10

Tom Fenech