Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Parse CSV with empty fields, escaped quotes and commas with awk

Tags:

regex

csv

awk

I have been using happily gawk with FPAT. Here's the script I use for my examples:

#!/usr/bin/gawk -f

BEGIN {
    FPAT="([^,]*)|(\"[^\"]+\")"
}

{
    for (i=1; i<=NF; i++) {
        printf "Record #%s, field #%s: %s\n", NR, i, $i
    }
}

Simple, no quotes

Works well.

$ echo 'a,b,c,d' | ./test.awk 
Record #1, field #1: a
Record #1, field #2: b
Record #1, field #3: c
Record #1, field #4: d

With quotes

Works well.

$ echo '"a","b",c,d' | ./test.awk 
Record #1, field #1: "a"
Record #1, field #2: "b"
Record #1, field #3: c
Record #1, field #4: d

With empty columns and quotes

Works well.

$ echo '"a","b",,d' | ./test.awk 
Record #1, field #1: "a"
Record #1, field #2: "b"
Record #1, field #3: 
Record #1, field #4: d

With escaped quotes, empty columns and quotes

Works well.

$ echo '"""a"": aaa","b",,d' | ./test.awk 
Record #1, field #1: """a"": aaa"
Record #1, field #2: "b"
Record #1, field #3: 
Record #1, field #4: d

With a column containing escaped quotes and ending with a comma

Fails.

$ echo '"""a"": aaa,","b",,d' | ./test.awk 
Record #1, field #1: """a"": aaa
Record #1, field #2: ","
Record #1, field #3: b"
Record #1, field #4: 
Record #1, field #5: d

Expected output:

$ echo '"""a"": aaa,","b",,d' | ./test_that_would_be_working.awk 
Record #1, field #1: """a"": aaa,"
Record #1, field #2: "b"
Record #1, field #4: 
Record #1, field #5: d

Is there a regex for FPAT that would make this work, or is this just not supported by awk?

The pattern would be " followed by anything but a single ". The regex class search works one character at a time so it can't not match a "".

I think there may be an option with lookaround, but I'm not good enough with it to make it work.

like image 511
Benoit Duffez Avatar asked Nov 02 '17 16:11

Benoit Duffez


1 Answers

Because awk's FPAT doesn't know lookarounds, you need to be explicit in your patterns. This one will do:

FPAT="[^,\"]*|\"([^\"]|\"\")*\""

Explanation:

[^,\"]*             # match 0 or more times any character except , and "
|                   # OR
\"                  # match '"'
  ([^\"]            #   followed by 0 or more anything but '"'
   |                #   OR
   \"\"             #   '""'
  )*        
\"                  # ending with '"'

Now testing it:

$ cat tst.awk
BEGIN {
    FPAT="[^,\"]*|\"([^\"]|\"\")*\""
}
{ 
   for (i=1; i<=NF; i++){ printf "Record #%s, field #%s: %s\n", NR, i, $i }
}


$ echo '"""a"": aaa,","b",,d' | awk -f tst.awk
Record #1, field #1: """a"": aaa,"
Record #1, field #2: "b"
Record #1, field #3:
Record #1, field #4: d
like image 62
Marc Lambrichs Avatar answered Oct 22 '22 02:10

Marc Lambrichs