I am using awk to parse my data with "," as separator as the input is a csv file. However, there are "," within the data which is escaped by double quotes ("...").
Example
filed1,filed2,field3,"field4,FOO,BAR",field5
How can i ignore the comma "," within the the double quote so that I can parse the output correctly using awk? I know we can do this in excel, but how do we do it in awk?
One use of an escape sequence is to include a double-quote character in a string constant. Because a plain double quote ends the string, you must use ' \" ' to represent an actual double-quote character as a part of the string. For example: $ awk 'BEGIN { print "He said \"hi!\
The FPAT variable offers a solution for cases like this. The value of FPAT should be a string that provides a regular expression. This regular expression describes the contents of each field.
Just put your desired field separator with the -F option in the AWK command and the column number you want to print segregated as per your mentioned field separator.
You can import double quotation marks using CSV files and import maps by escaping the double quotation marks. To escape the double quotation marks, enclose them within another double quotation mark.
It's easy, with GNU awk 4:
zsh-4.3.12[t]% awk '{ for (i = 0; ++i <= NF;) printf "field %d => %s\n", i, $i }' FPAT='([^,]+)|("[^"]+")' infile field 1 => filed1 field 2 => filed2 field 3 => field3 field 4 => "field4,FOO,BAR" field 5 => field5
Adding some comments as per OP requirement.
From the GNU awk manual on "Defining fields by content:
The value of FPAT should be a string that provides a regular expression. This regular expression describes the contents of each field. In the case of CSV data as presented above, each field is either “anything that is not a comma,” or “a double quote, anything that is not a double quote, and a closing double quote.” If written as a regular expression constant, we would have
/([^,]+)|("[^"]+")/
. Writing this as a string requires us to escape the double quotes, leading to:
FPAT = "([^,]+)|(\"[^\"]+\")"
Using +
twice, this does not work properly for empty fields, but it can be fixed as well:
As written, the regexp used for FPAT requires that each field contain at least one character. A straightforward modification (changing the first ‘
+
’ to ‘*
’) allows fields to be empty:
FPAT = "([^,]*)|(\"[^\"]+\")"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With