Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Escaping separator within double quotes, in awk

I am using awk to parse my data with "," as separator as the input is a csv file. However, there are "," within the data which is escaped by double quotes ("...").

Example

filed1,filed2,field3,"field4,FOO,BAR",field5 

How can i ignore the comma "," within the the double quote so that I can parse the output correctly using awk? I know we can do this in excel, but how do we do it in awk?

like image 868
joomanji Avatar asked Oct 18 '11 08:10

joomanji


People also ask

How do you escape double quotes in awk?

One use of an escape sequence is to include a double-quote character in a string constant. Because a plain double quote ends the string, you must use ' \" ' to represent an actual double-quote character as a part of the string. For example: $ awk 'BEGIN { print "He said \"hi!\

What is FPAT in awk?

The FPAT variable offers a solution for cases like this. The value of FPAT should be a string that provides a regular expression. This regular expression describes the contents of each field.

How do you use a field separator in awk?

Just put your desired field separator with the -F option in the AWK command and the column number you want to print segregated as per your mentioned field separator.

How do I escape a double quote in a csv file?

You can import double quotation marks using CSV files and import maps by escaping the double quotation marks. To escape the double quotation marks, enclose them within another double quotation mark.


1 Answers

It's easy, with GNU awk 4:

zsh-4.3.12[t]% awk '{   for (i = 0; ++i <= NF;)    printf "field %d => %s\n", i, $i  }' FPAT='([^,]+)|("[^"]+")' infile field 1 => filed1 field 2 => filed2 field 3 => field3 field 4 => "field4,FOO,BAR" field 5 => field5 

Adding some comments as per OP requirement.

From the GNU awk manual on "Defining fields by content:

The value of FPAT should be a string that provides a regular expression. This regular expression describes the contents of each field. In the case of CSV data as presented above, each field is either “anything that is not a comma,” or “a double quote, anything that is not a double quote, and a closing double quote.” If written as a regular expression constant, we would have /([^,]+)|("[^"]+")/. Writing this as a string requires us to escape the double quotes, leading to:

FPAT = "([^,]+)|(\"[^\"]+\")"

Using + twice, this does not work properly for empty fields, but it can be fixed as well:

As written, the regexp used for FPAT requires that each field contain at least one character. A straightforward modification (changing the first ‘+’ to ‘*’) allows fields to be empty:

FPAT = "([^,]*)|(\"[^\"]+\")"

like image 88
Dimitre Radoulov Avatar answered Sep 20 '22 17:09

Dimitre Radoulov