Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

awk FPAT variable: Working

Tags:

regex

awk

gawk

I have been able to understand from the GNU page of GAWK that it can handle delimiters in data using the FPAT variable but I can't make through how this works. For a CSV file the FPAT value is:

FPAT = "([^,]+)|(\"[^\"]+\")"

Using the data:

abc,"pqr,mno"

The first grouped expression evaluates to everything i.e. not a comma, this should take "abc" as data then fail for the first occurrence of comma. Now my question is what happens next? As the first grouped expression failed will the regexp continue from the character after comma using the or condition? but the first grouped expression continues to be valid for all data after the comma so it might take "pqr as next data?

like image 937
ghub24 Avatar asked Oct 15 '13 10:10

ghub24


1 Answers

So the field patterns are described as the following.

A string not containing a comma where the string length is greater than zero (won't match empty strings):

[^,]+

Or a string starting and ending with a double quotes and containing at least one character that isn't a double quote (escaping backslashes left out for readability):

"[^"]+"      

Regular expression engine match from the beginning of the string and try to match as much as possible given the patterns.

abc,"pqr,mno" 

So abc is longest string matched by either pattern from the start of the string and hence becomes $1. The next character , cannot be matched by either pattern so the regular expression engine just moves to the next character " with starts matching the second pattern. This is matched until the end of line as "pqr,mno" is a string that starts and ends with double quotes and contains at least one non-double-quote character. Therefore "pqr,mno" become $2 for the record abc,"pqr,mno".

like image 150
Chris Seymour Avatar answered Oct 02 '22 02:10

Chris Seymour