awk: canonical way to reference a field by name (first line as header)

Question

I often experience this kind of situation: initially my data looks like this:

$ cat data.csv
id name age
0 Mark 20
1 Robert 35
2 John 15

Given that I want to access the age field, I write my awk script like this:

$ awk '{ print $3; /* print age */ }' data.csv                                                                          
age
20
35
15

Then, eventually, the columns order got changed. For example, surname is inserted between name and age, so that now age is shifted to the 4th position:

$ cat data.csv                                                                                                              
id name surname age
0 Mark Ross 20
1 Robert Green 35
2 John Doe 15

If I run my script again, I obviously get the wrong output:

$ awk '{ print $3; /* print age */ }' data.csv                                                                              
surname
Ross
Green
Doe

To avoid this issue I use to reference the fields by name using this bit of code:

$ awk 'NR == 1 { split($0, keys) }; { for (i=1; i<=length(keys); i++) { f[keys[i]]=$i }; print f["age"]; /* print age */ }' data.csv
age
20
35
15

This works fine and allows me to reference any fields by name, such as f["id"], f["name"] etc. as desired.

What I don't like here is that the script looks hackish and verbose. I have the feeling that I'm missing out a native function or a more straightforward way to achieve something similar.

Does awk offer a canonical way to reference fields by name, using the first line as a header?

jhnc · Accepted Answer

I don't know what is canonical but you can avoid a lot of unnecessary copying.

From your code:

# why split again? we already have $1..$NF
NR == 1 { split($0, keys) }
{
    # why do this on every line of input?
    for (i=1; i<=length(keys); i++){ f[keys[i]]=$i }

    print f["age"]; /* print age */
}

to:

NR==1 { while (i++ < NF) f[$i]=i }

{ print $f["age"] }
#       ^

Note: I have tried to make the first-line pattern/action as concise as possible to address your verbosity concern but the more straightforward for(i=1;i<=NF;i++) is nearly as short and will be more maintainable than this while, and without the potential gotcha if i has already been used.

markp-fuso · Answer

Setup:

$ head data.?.csv
==> data.1.csv <==
id name age
0 Mark 20
1 Robert 35
2 John 15

==> data.2.csv <==
id name surname age
0 Mark Ross 20
1 Robert Green 35
2 John Doe 15

I'd convert the keys[] array to an associative array (column headers/names are the keys of the associative array), eg:

colname='age'

for fname in data.1.csv data.2.csv
do
    echo "########## file: ${fname}"

    awk -v cname="${colname}" '
    FNR==1 { for (i=1;i<=NF;i++)
                 keys[$i]=i
           }
           { print $(keys[cname]) }
    ' "${fname}"
done

This generates:

########## file: data.1.csv
age
20
35
15
########## file: data.2.csv
age
20
35
15

awk: canonical way to reference a field by name (first line as header)

Tags:

awk

etuardu

2 Answers

jhnc

markp-fuso

Recent Activity

Donate For Us

awk: canonical way to reference a field by name (first line as header)

Tags:

awk

etuardu

2 Answers

jhnc

markp-fuso

Related questions

Recent Activity

Donate For Us