Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest CSV Parser in Perl

Tags:

parsing

csv

perl

I am creating a subroutine that:

(1) Parses a CSV file;

(2) And checks if all the rows in that file have the expected number of columns. It croaks if the number of columns is invalid.

When the number of rows is ranging from thousands to millions, what do you think is the most efficient way to do it?

Right now, I'm trying out these implementations.

(1) Basic file parser

open my $in_fh, '<', $file or 
    croak "Cannot open '$file': $OS_ERROR";                                                            
                                                                                                      
my $row_no = 0;                                                                                           
while ( my $row = <$in_fh> ) {                                                                            
    my @values = split (q{,}, $row);                                                                      
    ++$row_no;                                                                                            
    if ( scalar @values < $min_cols_no ) {                                                                
        croak "Invalid file format. File '$file' does not have '$min_cols_no' columns in line '$row_no'.";
    }                                                                                                     
}                                                                                                         
                                                                                                      
close $in_fh                                                                                              
    or croak "Cannot close '$file': $OS_ERROR";                                                           

(2) Using Text::CSV_XS (bind_columns and csv->getline)

my $csv = Text::CSV_XS->new () or                                                                         
   croak "Cannot use CSV: " . Text::CSV_XS->error_diag();                                                 
open my $in_fh, '<', $file or                                                                             
   croak "Cannot open '$file': $OS_ERROR";                                                                
                                                                                                          
 my $row_no = 1;                                                                                          
 my @cols = @{$csv->getline($in_fh)};                                                                     
 my $row = {};                                                                                            
 $csv->bind_columns (\@{$row}{@cols});                                                                    
 while ($csv->getline ($in_fh)) {                                                                         
    ++$row_no;                                                                                            
    if ( scalar keys %$row < $min_cols_no ) {                                                             
        croak "Invalid file format. File '$file' does not have '$min_cols_no' columns in line '$row_no'.";
    }                                                                                                     
}                                                                                                         
                                                                                                          
$csv->eof or $csv->error_diag();                                                                          
close $in_fh or
    croak "Cannot close '$file': $OS_ERROR";                                                           

(3) Using Text::CSV_XS (csv->parse)

my $csv = Text::CSV_XS->new() or                                                                         
   croak "Cannot use CSV: " . Text::CSV_XS->error_diag();                                                
 open my $in_fh, '<', $file or                                                                           
   croak "Cannot open '$file': $OS_ERROR";                                                               
                                                                                                         
 my $row_no = 0;                                                                                         
 while ( <$in_fh> ) {                                                                                    
     $csv->parse($_);                                                                                    
     ++$row_no;                                                                                          
     if ( scalar $csv->fields < $min_cols_no ) {                                                         
       croak "Invalid file format. File '$file' does not have '$min_cols_no' columns in line '$row_no'.";
     }                                                                                                   
}                                                                                                        
                                                                                                         
$csv->eof or $csv->error_diag();                                                                         
close $in_fh or 
    croak "Cannot close '$file': $OS_ERROR";                                                          

(4) Using Parse::CSV

use Parse::CSV;                                                                                           
my $simple = Parse::CSV->new(                                                                             
    file => $file                                                                                         
);                                                                                                        
                                                                                                          
my $row_no = 0;                                                                                           
while ( my $array_ref = $simple->fetch ) {                                                                
    ++$row_no;                                                                                            
    if ( scalar @$array_ref < $min_cols_no ) {                                                            
        croak "Invalid file format. File '$file' does not have '$min_cols_no' columns in line '$row_no'.";
    }                                                                                                     
}                                                                                                         

I benchmark-ed them using the Benchmark module.

use Benchmark qw(timeit timestr timediff :hireswallclock);

And these are the numbers (in seconds) that I got:

1,000 lines of file:

Implementation 1: 0.0016

Implementation 2: 0.0025

Implementation 3: 0.0050

Implementation 4: 0.0097

10,000 lines of file:

Implementation 1: 0.0204

Implementation 2: 0.0244

Implementation 3: 0.0523

Implementation 4: 0.1050

1,500,000 lines of file:

Implementation 1: 1.8697

Implementation 2: 3.1913

Implementation 3: 7.8475

Implementation 4: 15.6274

Given these numbers, I would conclude that the simple parser is the fastest but from what I have read from different sources, Text::CSV_XS should be the fastest.

Will someone enlighten me on this? Is there something wrong with how I used the modules? Thanks a lot for your help!

like image 583
Carlisle18 Avatar asked Dec 17 '12 15:12

Carlisle18


People also ask

How do I read a column from a CSV file in Perl?

To read a column from csv for this purpose I wrote this script: #!/usr/bin/perl -w use strict; use warnings; use Text::CSV; my$column_separator = qr/,/; my $column_number = "3"; my$file = "/home/Admin/Documents/new (copy).

What is the difference between text csv and application csv?

If a server says "This data is of type text/csv" the client can understand that can render that data internally, while if the server says "This data is of type application/csv" the client knows that it needs to launch the application that is registered on the OS to open csv files.

What is a csv parser?

The Comma Separated Values (CSV) Parser reads and writes data in a CSV format. Note: In the Config Editor, the parameters are set in the Parser tab of the Connector.


3 Answers

There are CSV files

header1,header2,header3
value1,value2,value3

and then there are CSV files.

header1,"This, as they say, is header2","And header3
even contains a newline!"
value1,"value2, 2nd in a series of 3 values",value3

Text::CSV and its ilk have been painstakingly developed and tested to deal with the second kind. If you are confident that your input does and always will conform to the simple CSV specification, then it is very likely that you can build a parser that will outperform Text::CSV.

like image 166
mob Avatar answered Oct 11 '22 23:10

mob


Note that your Text::CSV_XS version does more than your simple parser version. It splits the line, puts it into memory, and makes your hashref point to the fields.

It also may have other logic under the hood, like allowing escaped delimiters (I don't know, as I haven't used it). On top of that, there is always a small amount of overhead when using a module: function calls, passing parameters back and forth, and perhaps generic code that doesn't really apply in your case (such as error checking for things you don't care about).

Normally the benefits of using a module greatly outweigh the costs. You get more features, more reliable code, etc. But that might not be true with a small, very simple task. If all you need to do is verify the number of columns, using a module might be overkill. You could make your own implementation even faster by just counting the number of columns, and not bothering to split at all:

/(?:,[^,]*){$min_cols_no-1}/ or croak "Did not find minimum number of columns";

If you are going to do real processing in addition to this verification step, using the module will probably be beneficial.

like image 42
dan1111 Avatar answered Oct 11 '22 22:10

dan1111


All CSV parsing modules do the same thing: opening the file and parse the CSV in some way, much like you did in your basic sub. They just carry a lot more overhead because internally, they do a lot more than you need (check for proper CSV format, pass around object structures etc). That makes them slower than your basic approach, to varying extent.

You benchmarked the approaches yourself; isn't the result obvious? If I didn't need the extended functionality of the CSV modules, I would parse a CSV file the basic way myself.

(I don't know if you could speed them up by improving your usage of the modules)

like image 24
mpe Avatar answered Oct 11 '22 21:10

mpe