I have genomics files of the following type:
$ cat test-file_long.txt
2 41647 A G
2 45895 A G
2 45953 T C
2 224919 A G
2 230055 C G
2 233239 A G
2 234130 T G
2 23454 T C
When I use the following short AWK script, it does not return all of the elements which are greater than the element used in the if statement:
{
a[$2]
}
END{
for (i in a){
if(i > 45895)
print i
}
}
The script returns this:
$ awk -f practice.awk test-file_long.txt
45953
However, when I change the if statement using int(), it returns the lines that are in fact greater than, as I want:
{
a[$2]
}
END{
for (i in a){
if(int(i) > 45895)
print i
}
}
Result:
$ awk -f practice.awk test-file_long.txt
233239
230055
234130
224919
45953
It appears it is only making the comparison with the first digit, and if they are the same it looks at the next digit, but it does not process the whole number. Can someone explain to me what it is about the internal mechanism of the associative array that it does not make the numeric >/< comparison unless I specify that I want the int() of the array element? What if my array elements were floats and int() was not an option?
In awk , you don't need to specify the size of an array before you start to use it. Additionally, any number or string in awk may be used as an array index, not just consecutive integers. In most other languages, you have to declare an array and specify how many elements or components it contains.
It's not a duality awk accepts either 0 or 1 , gawk at least (and I suspect all modern awks) converts anything that's not a natural number to a natural number and use that as the index so any number less than 1 like 0 or -1, or any non-numeric string like "foo bar" get converted to 1, and any non integer numbers like ...
AWK has associative arrays and one of the best thing about it is – the indexes need not to be continuous set of number; you can use either string or number as an array index. Also, there is no need to declare the size of an array in advance – arrays can expand/shrink at runtime.
The awk language provides one-dimensional arrays for storing groups of related strings or numbers. Every awk array must have a name. Array names have the same syntax as variable names; any valid variable name would also be a valid array name.
Array keys in awk are strings, so alphabetical comparison is being done here. In your first example, 459
is greater than 458
alphabetically, so it passes the test.
If your only goal is to print the lines whose 2nd column is > 45895
numerically, this would do:
awk '$2 > 45895' test-file_long.txt
Variables change type depending on the context in which they are evaluated. So by putting a variable in an explicitly numeric context, it will be treated as such. @glenn's suggestion of i+0
demonstrates this perfectly.
Alternatively, the unary plus operator +i
can be used to convert an expression to a number. So your longer example could be changed to:
awk '{a[$2]} END { for (i in a) { if (+i > 45895) print i } }' test-file_long.txt
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With