Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Picking up CSV fields by name using awk

Tags:

csv

awk

gawk

Suppose I have a CSV file with headers of the following form:

Field1,Field2
3,262000
4,449000
5,650000
6,853000
7,1061000
8,1263000
9,1473000
10,1683000
11,1893000

I would like to write an awk script which will take a comma-separated list of field names target, split it into an array, and then only pick out those columns with the names I specify.

This is what I have tried so far, and I have verified that the head array contains the desired headers, and the targets array contains the desired targets passed in by the given command line.

BEGIN{
    FS=","
    split(target, targets, ",")

}

NR==1 {
    for (i = 1; i <= NF; i++) head[i] = $i
}

NR !=1{
    for (i = 1; i <= NF; i++) {
        if (head[i] in targets){
            print $i
        }
    }
}

When I invoke this script with the command

awk -v target=Field1 -f GetCol.awk Debug.csv

I get nothing printed out.

like image 889
merlin2011 Avatar asked Apr 18 '13 20:04

merlin2011


3 Answers

I figured it out and am posting the answer in case others run into the same problem.

It has to do with the in keyword I am using for testing array membership. This keyword only tests whether the operand on the left is one of the indices in the array on the right, rather than on of the values. The fix is to create a reverse-lookup array, as follows.

BEGIN{
    OFS=FS=","
    split(target, t_targets, ",")
    for (i in t_targets)
        targets[t_targets[i]] = i
}
like image 197
merlin2011 Avatar answered Oct 19 '22 19:10

merlin2011


My two cents:

BEGIN{
    OFS=FS=","
    split(target,fields,FS)            # We just set FS don't hard the comma here
    for (i in fields)                  # Distinct var name to aviod headaches
        field_idx[fields[i]] = i       # Reverse lookup 
}
NR==1 {                                # Process header
    for (i=1;i<=NF;i++)                # For each field header
        head[i] = $i                   # Add to hash for comparision with target
    next                               # Skip to next line
}
{                                      # Don't need invert condition (used next)
    sep=""                             # Set for leading separator
    for (i=1;i<=NF;i++)                # For each field
        if (head[i] in field_idx) {    # Test for current field is a target field
            printf "%s%s",sep,$i       # Print the column if matched 
            sep=OFS                    # Set separator to OFS                  
    }
    printf "\n"                        # Print newline character
}
like image 36
Chris Seymour Avatar answered Oct 19 '22 20:10

Chris Seymour


An extension of @sudo_O's solution (thank you) that

  1. outputs fields from standard input based on the command line arguments,
  2. outputs the fields in the order requested (possibly multiple times),
  3. outputs a placeholder when a field was requested but not found, and
  4. warns on standard error about duplicate field names in the header.
#!/usr/bin/awk -f
# Process standard input outputting named columns provided as arguments.
#
# For example, given foo.dat containing
#     a b c c
#     1a 1b 1c 1C
#     2a 2b 2c 2C
#     3a 3b 3c 3C
# Running
#   cat foo.dat | ./namedcols c b a a d
# will output
#   1c 1b 1a 1a d
#   2c 2b 2a 2a d
#   3c 3b 3a 3a d
# and will warn on standard error that it
#   Ignored duplicate 'c' in column 4
# Notice that the requested but missing column d contains "d".
#
# Using awk's -F feature it is possible to parse comma-separated data:
#   cat foo.csv | ./namedcols -F, c b a a d
BEGIN {
    for (i=1; i<ARGC; ++i)
        desired[i] = ARGV[i]
    delete ARGV
}
NR==1 {
    for (i=1; i<=NF; i++)
        if ($i in names)
            printf "Ignored duplicate '%s' in column %d\n", $i, i | "cat 1>&2"
        else
            names[$i] = i
    next
}
{
    for (i=1; i<ARGC; ++i)
        printf "%s%s",                                          \
               (i==1 ? "" : OFS),                               \
               ((ndx = names[name = desired[i]])>0 ? $ndx: name)
    printf RS
}
like image 1
Rhys Ulerich Avatar answered Oct 19 '22 20:10

Rhys Ulerich