Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using awk or perl to extract specific columns from CSV (parsing)

Background - I want to extract specific columns from a csv file. The csv file is comma delimited, uses double quotes as the text-qualifier (optional, but when a field contains special characters, the qualifier will be there - see example), and uses backslashes as the escape character. It is also possible for some fields to be blank.


Example Input and Desired Output - For example, I only want columns 1, 3, and 4 to be in the output file. The final extract of the columns from the csv file should match the format of the original file. No escape characters should be removed or extra quotes added and such.

Input

"John \"Super\" Doe",25,"123 ABC Street",123-456-7890,"M",A
"Jane, Mary","",132 CBS Street,333-111-5332,"F",B
"Smith \"Jr.\", Jane",35,,555-876-1233,"F",
"Lee, Jack",22,123 Sesame St,"","M",D

Desired Output

"John \"Super\" Doe","123 ABC Street",123-456-7890
"Jane, Mary",132 CBS Street,333-111-5332
"Smith \"Jr.\", Jane",,555-876-1233
"Lee, Jack",123 Sesame St,""

Preliminary Script (awk) - The following is a preliminary script I found that works for the most part, but does not work in one particular instance that I noticed and possibly more that I have not seen or thought of yet

#!/usr/xpg4/bin/awk -f

BEGIN{  OFS = FS = ","  }

/"/{
    for(i=1;i<=NF;i++){
        if($i ~ /^"[^"]+$/){
            for(x=i+1;x<=NF;x++){
                $i=$i","$x
                if($i ~ /"+$/){
                    z = x - (i + 1) + 1
                    for(y=i+1;y<=NF;y++)
                        $y = $(y + z)
                    break
                }
            }
            NF = NF - z
            i=x
        }
    }
print $1,$3,$4
}

The above seems to work well until it comes across a field that contains both escaped double quotes as well as a comma. In that case, the parsing will be off and the output will be incorrect.


Question/Comments - I have read that awk is not the best option for parsing through csv files, and perl is suggested. However, I do not know perl at all. I have found some examples of perl scripts, but they do not give the desired output I am looking for and I do not know how to edit the scripts easily for what I want.

As for awk, I am familiar with it and use the basic functionality of it occasionally, but I do not know a lot of the advanced functionality like some of the commands used in the script above. Is my desired output possible just by using awk? If so, would it be possible edit the script above to fix the issue I am having with it? Could someone explain line by line what exactly the script is doing?

Any help would be appreciated, thanks!

like image 448
yousir Avatar asked Feb 15 '12 04:02

yousir


1 Answers

I'm not going to reinvent the wheel.

use Text::CSV_XS;

my $csv = Text::CSV_XS->new({
   binary      => 1,
   escape_char => '\\',
   eol         => "\n",
});

my $fh_in  = \*STDIN;
my $fh_out = \*STDOUT;

while (my $row = $csv->getline($fh_in)) {
   $csv->print($fh_out, [ @{$row}[0,2,3] ])
      or die("".$csv->error_diag());
}

$csv->eof()
   or die("".$csv->error_diag());

Output:

"John \"Super\" Doe","123 ABC Street",123-456-7890
"Jane, Mary","132 CBS Street",333-111-5332
"Smith \"Jr.\", Jane",,555-876-1233
"Lee, Jack","123 Sesame St",

It adds quotes around addresses that didn't have any already, but since some addresses already have quotes around them, you obviously can handle that.


Reinventing the wheel:

my $field = qr/"(?:[^"\\]|\\.)*"|[^"\\,]*/s;
while (<>) {
   my @fields = /^($field),$field,($field),($field),/
      or die;
   print(join(',', @fields), "\n");
}

Output:

"John \"Super\" Doe","123 ABC Street",123-456-7890
"Jane, Mary",132 CBS Street,333-111-5332
"Smith \"Jr.\", Jane",,555-876-1233
"Lee, Jack",123 Sesame St,""
like image 184
ikegami Avatar answered Oct 05 '22 20:10

ikegami