Suppose I have a CSV file with headers of the following form:
Field1,Field2
3,262000
4,449000
5,650000
6,853000
7,1061000
8,1263000
9,1473000
10,1683000
11,1893000
I would like to write an awk script which will take a comma-separated list of field names target
, split it into an array, and then only pick out those columns with the names I specify.
This is what I have tried so far, and I have verified that the head
array contains the desired headers, and the targets
array contains the desired targets passed in by the given command line.
BEGIN{
FS=","
split(target, targets, ",")
}
NR==1 {
for (i = 1; i <= NF; i++) head[i] = $i
}
NR !=1{
for (i = 1; i <= NF; i++) {
if (head[i] in targets){
print $i
}
}
}
When I invoke this script with the command
awk -v target=Field1 -f GetCol.awk Debug.csv
I get nothing printed out.
I figured it out and am posting the answer in case others run into the same problem.
It has to do with the in
keyword I am using for testing array membership.
This keyword only tests whether the operand on the left is one of the indices in the array on the right, rather than on of the values.
The fix is to create a reverse-lookup array, as follows.
BEGIN{
OFS=FS=","
split(target, t_targets, ",")
for (i in t_targets)
targets[t_targets[i]] = i
}
My two cents:
BEGIN{
OFS=FS=","
split(target,fields,FS) # We just set FS don't hard the comma here
for (i in fields) # Distinct var name to aviod headaches
field_idx[fields[i]] = i # Reverse lookup
}
NR==1 { # Process header
for (i=1;i<=NF;i++) # For each field header
head[i] = $i # Add to hash for comparision with target
next # Skip to next line
}
{ # Don't need invert condition (used next)
sep="" # Set for leading separator
for (i=1;i<=NF;i++) # For each field
if (head[i] in field_idx) { # Test for current field is a target field
printf "%s%s",sep,$i # Print the column if matched
sep=OFS # Set separator to OFS
}
printf "\n" # Print newline character
}
An extension of @sudo_O's solution (thank you) that
#!/usr/bin/awk -f
# Process standard input outputting named columns provided as arguments.
#
# For example, given foo.dat containing
# a b c c
# 1a 1b 1c 1C
# 2a 2b 2c 2C
# 3a 3b 3c 3C
# Running
# cat foo.dat | ./namedcols c b a a d
# will output
# 1c 1b 1a 1a d
# 2c 2b 2a 2a d
# 3c 3b 3a 3a d
# and will warn on standard error that it
# Ignored duplicate 'c' in column 4
# Notice that the requested but missing column d contains "d".
#
# Using awk's -F feature it is possible to parse comma-separated data:
# cat foo.csv | ./namedcols -F, c b a a d
BEGIN {
for (i=1; i<ARGC; ++i)
desired[i] = ARGV[i]
delete ARGV
}
NR==1 {
for (i=1; i<=NF; i++)
if ($i in names)
printf "Ignored duplicate '%s' in column %d\n", $i, i | "cat 1>&2"
else
names[$i] = i
next
}
{
for (i=1; i<ARGC; ++i)
printf "%s%s", \
(i==1 ? "" : OFS), \
((ndx = names[name = desired[i]])>0 ? $ndx: name)
printf RS
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With