I often experience this kind of situation: initially my data looks like this:
$ cat data.csv
id name age
0 Mark 20
1 Robert 35
2 John 15
Given that I want to access the age field, I write my awk script like this:
$ awk '{ print $3; /* print age */ }' data.csv
age
20
35
15
Then, eventually, the columns order got changed. For example, surname is inserted between name and age, so that now age is shifted to the 4th position:
$ cat data.csv
id name surname age
0 Mark Ross 20
1 Robert Green 35
2 John Doe 15
If I run my script again, I obviously get the wrong output:
$ awk '{ print $3; /* print age */ }' data.csv
surname
Ross
Green
Doe
To avoid this issue I use to reference the fields by name using this bit of code:
$ awk 'NR == 1 { split($0, keys) }; { for (i=1; i<=length(keys); i++) { f[keys[i]]=$i }; print f["age"]; /* print age */ }' data.csv
age
20
35
15
This works fine and allows me to reference any fields by name, such as f["id"], f["name"] etc. as desired.
What I don't like here is that the script looks hackish and verbose. I have the feeling that I'm missing out a native function or a more straightforward way to achieve something similar.
Does awk offer a canonical way to reference fields by name, using the first line as a header?
I don't know what is canonical but you can avoid a lot of unnecessary copying.
From your code:
# why split again? we already have $1..$NF
NR == 1 { split($0, keys) }
{
# why do this on every line of input?
for (i=1; i<=length(keys); i++){ f[keys[i]]=$i }
print f["age"]; /* print age */
}
to:
NR==1 { while (i++ < NF) f[$i]=i }
{ print $f["age"] }
# ^
Note: I have tried to make the first-line pattern/action as concise as possible to address your verbosity concern but the more straightforward for(i=1;i<=NF;i++) is nearly as short and will be more maintainable than this while, and without the potential gotcha if i has already been used.
Setup:
$ head data.?.csv
==> data.1.csv <==
id name age
0 Mark 20
1 Robert 35
2 John 15
==> data.2.csv <==
id name surname age
0 Mark Ross 20
1 Robert Green 35
2 John Doe 15
I'd convert the keys[] array to an associative array (column headers/names are the keys of the associative array), eg:
colname='age'
for fname in data.1.csv data.2.csv
do
echo "########## file: ${fname}"
awk -v cname="${colname}" '
FNR==1 { for (i=1;i<=NF;i++)
keys[$i]=i
}
{ print $(keys[cname]) }
' "${fname}"
done
This generates:
########## file: data.1.csv
age
20
35
15
########## file: data.2.csv
age
20
35
15
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With