What's the most robust way to efficiently parse CSV using awk?

Tags:

awk

The intent of this question is to provide a canonical answer.

Given a CSV as might be generated by Excel or other tools with embedded newlines and/or double quotes and/or commas in fields, and empty fields like:

$ cat file.csv "rec1, fld1",,"rec1"",""fld3.1 "", fld3.2","rec1 fld4" "rec2, fld1.1  fld1.2","rec2 fld2.1""fld2.2""fld2.3","",rec2 fld4 """""","""rec3,fld2""",

What's the most robust way efficiently using awk to identify the separate records and fields:

Record 1:     $1=<rec1, fld1>     $2=<>     $3=<rec1","fld3.1 ", fld3.2>     $4=<rec1 fld4> ---- Record 2:     $1=<rec2, fld1.1  fld1.2>     $2=<rec2 fld2.1"fld2.2"fld2.3>     $3=<>     $4=<rec2 fld4> ---- Record 3:     $1=<"">     $2=<"rec3,fld2">     $3=<> ----

so it can be used as those records and fields internally by the rest of the awk script.

A valid CSV would be one that conforms to RFC 4180 or can be generated by MS-Excel.

The solution must tolerate the end of record just being LF (\n) as is typical for UNIX files rather than CRLF (\r\n) as that standard requires and Excel or other Windows tools would generate. It will also tolerate unquoted fields mixed with quoted fields. It will specifically not need to tolerate escaping "s with a preceding backslash (i.e. \" instead of "") as some other CSV formats allow - if you have that then adding a gsub(/\\"/,"\"\"") up front would handle it and trying to handle both escaping mechanisms automatically in one script would make the script unnecessarily fragile and complicated.

936

asked Jul 31 '17 16:07

Ed Morton

2 Answers

If your CSV cannot contain newlines then all you need is (with GNU awk for FPAT):

$ echo 'foo,"field,""with"",commas",bar' |     awk -v FPAT='[^,]*|("([^"]|"")*")' '{for (i=1; i<=NF;i++) print i, "<" $i ">"}' 1 <foo> 2 <"field,""with"",commas"> 3 <bar>

If all you actually want to do is convert your CSV to individual lines by, say, replacing newlines with blanks and commas with semi-colons inside quoted fields then all you need is this, again using GNU awk for multi-char RS and RT:

$ awk -v RS='"([^"]|"")*"' -v ORS= '{gsub(/\n/," ",RT); gsub(/,/,";",RT); print $0 RT}' file.csv "rec1; fld1",,"rec1"";""fld3.1 ""; fld3.2","rec1 fld4" "rec2; fld1.1  fld1.2","rec2 fld2.1""fld2.2""fld2.3","",rec2 fld4 """""","""rec3;fld2""",

Otherwise, though, the general, robust, portable solution to identify the fields that will work with any modern awk* is:

$ cat decsv.awk function buildRec(      fpat,fldNr,fldStr,done) {     CurrRec = CurrRec $0     if ( gsub(/"/,"&",CurrRec) % 2 ) {         # The string built so far in CurrRec has an odd number         # of "s and so is not yet a complete record.         CurrRec = CurrRec RS         done = 0     }     else {         # If CurrRec ended with a null field we would exit the         # loop below before handling it so ensure that cannot happen.         # We use a regexp comparison using a bracket expression here         # and in fpat so it will work even if FS is a regexp metachar         # or a multi-char string like "\\\\" for \-separated fields.         CurrRec = CurrRec ( CurrRec ~ ("[" FS "]$") ? "\"\"" : "" )         $0 = ""         fpat = "([^" FS "]*)|(\"([^\"]|\"\")+\")"         while ( (CurrRec != "") && match(CurrRec,fpat) ) {             fldStr = substr(CurrRec,RSTART,RLENGTH)             # Convert <"foo"> to <foo> and <"foo""bar"> to <foo"bar>             if ( gsub(/^"|"$/,"",fldStr) ) {                 gsub(/""/, "\"", fldStr)             }             $(++fldNr) = fldStr             CurrRec = substr(CurrRec,RSTART+RLENGTH+1)         }         CurrRec = ""         done = 1     }     return done }  # If your input has \-separated fields, use FS="\\\\"; OFS="\\" BEGIN { FS=OFS="," } !buildRec() { next } {     printf "Record %d:\n", ++recNr     for (i=1;i<=NF;i++) {         # To replace newlines with blanks add gsub(/\n/," ",$i) here         printf "    $%d=<%s>\n", i, $i     }     print "----" }

$ awk -f decsv.awk file.csv Record 1:     $1=<rec1, fld1>     $2=<>     $3=<rec1","fld3.1 ", fld3.2>     $4=<rec1 fld4> ---- Record 2:     $1=<rec2, fld1.1  fld1.2>     $2=<rec2 fld2.1"fld2.2"fld2.3>     $3=<>     $4=<rec2 fld4> ---- Record 3:     $1=<"">     $2=<"rec3,fld2">     $3=<> ----

The above assumes UNIX line endings of \n. With Windows \r\n line endings it's much simpler as the "newlines" within each field will actually just be line feeds (i.e. \ns) and so you can set RS="\r\n" (using GNU awk for multi-char RS) and then the \ns within fields will not be treated as line endings.

It works by simply counting how many "s are present so far in the current record whenever it encounters the RS - if it's an odd number then the RS (presumably \n but doesn't have to be) is mid-field and so we keep building the current record but if it's even then it's the end of the current record and so we can continue with the rest of the script processing the now complete record.

*I say "modern awk" above because there's apparently extremely old (i.e. circa 2000) versions of tawk and mawk1 still around which have bugs in their gsub() implementation such that gsub(/^"|"$/,"",fldStr) would not remove the start/end "s from fldStr. If you're using one of those then get a new awk, preferably gawk, as there could be other issues with them too but if that's not an option then I expect you can work around that particular bug by changing this:

        if ( gsub(/^"|"$/,"",fldStr) ) {

to this:

        if ( sub(/^"/,"",fldStr) && sub(/"$/,"",fldStr) ) {

Thanks to the following people for identifying and suggesting solutions to the stated issues with the original version of this answer:

@mosvy for escaped double quotes within fields.
@datatraveller1 for multiple contiguous pairs of escaped quotes in a field and null fields at the end of records.

Related: also see How do I use awk under cygwin to print fields from an excel spreadsheet? for how to generate CSVs from Excel spreadsheets.

139

answered Oct 08 '22 10:10

Ed Morton

An improvement upon @EdMorton's FPAT solution, which should be able to handle double-quotes(") escaped by doubling ("" -- as allowed by the CSV standard).

gawk -v FPAT='[^,]*|("[^"]*")+' ...

This STILL

isn't able to handle newlines inside quoted fields, which are perfectly legit in standard CSV files.
assumes GNU awk (gawk), a standard awk won't do.

Example:

$ echo 'a,,"","y""ck","""x,y,z"," ",12' | gawk -v OFS='|' -v FPAT='[^,]*|("[^"]*")+' '{$1=$1}1' a||""|"y""ck"|"""x,y,z"|" "|12  $ echo 'a,,"","y""ck","""x,y,z"," ",12' | gawk -v FPAT='[^,]*|("[^"]*")+' '{   for(i=1; i<=NF;i++){     if($i~/"/){ $i = substr($i, 2, length($i)-2); gsub(/""/,"\"", $i) }     print "<"$i">"   } }' <a> <> <> <y"ck> <"x,y,z> < > <12>

answered Oct 08 '22 11:10

mosvy

Related questions
                            
                                What are some good free CSV editor programs? [closed]
                            
                                How to use delimiter for csv in python
                            
                                Check if file has a CSV format with Python
                            
                                What is the best way to get the list of column names using CsvHelper?
                            
                                How to write to CSV in Spark
                            
                                how do I loop through a line from a csv file in powershell
                            
                                Get and Parse CSV file in android
                            
                                How to make fputcsv "echo" the data
                            
                                CSV Export/Import with PHPExcel
                            
                                Unwanted double quotes in generated csv file
                            
                                Importing CSV data using PHP/MySQL
                            
                                Adding double quote delimiters into csv file
                            
                                ANGULAR 5 : how to export data to csv file
                            
                                PHP dynamically create CSV: Skip the first line of a CSV file
                            
                                Batch file to split .csv file
                            
                                Why the column order is changing while appending pandas dataframes?
                            
                                How to upload and read CSV files in React.js?
                            
                                In Scala, how to read a simple CSV file having a header in its first line?
                            
                                numpy savetxt formatted as integer is not saving zeroes
                            
                                How to paste CSV data to Windows Clipboard with C#

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What's the most robust way to efficiently parse CSV using awk?

Tags:

csv

awk

Ed Morton

People also ask

2 Answers

Ed Morton

mosvy

Recent Activity

Donate For Us