I am creating a subroutine that: (1) Parses a CSV file; (2) And checks if all the rows in that file have the expected number of columns. It croaks if the number of columns is invalid. When the number of rows is ranging from thousands to millions, what do you think is the most efficient way to do it? Right now, I'm trying out these implementations. (1) Basic file parser <pre class="prettyprint"><code>open my $in_fh, '<', $file or croak "Cannot open '$file': $OS_ERROR"; my $row_no = 0; while ( my $row = <$in_fh> ) { my @values = split (q{,}, $row); ++$row_no; if ( scalar @values < $min_cols_no ) { croak "Invalid file format. File '$file' does not have '$min_cols_no' columns in line '$row_no'."; } } close $in_fh or croak "Cannot close '$file': $OS_ERROR"; </code></pre> (2) Using Text::CSV_XS (bind_columns and csv->getline) <pre class="prettyprint"><code>my $csv = Text::CSV_XS->new () or croak "Cannot use CSV: " . Text::CSV_XS->error_diag(); open my $in_fh, '<', $file or croak "Cannot open '$file': $OS_ERROR"; my $row_no = 1; my @cols = @{$csv->getline($in_fh)}; my $row = {}; $csv->bind_columns (\@{$row}{@cols}); while ($csv->getline ($in_fh)) { ++$row_no; if ( scalar keys %$row < $min_cols_no ) { croak "Invalid file format. File '$file' does not have '$min_cols_no' columns in line '$row_no'."; } } $csv->eof or $csv->error_diag(); close $in_fh or croak "Cannot close '$file': $OS_ERROR"; </code></pre> (3) Using Text::CSV_XS (csv->parse) <pre class="prettyprint"><code>my $csv = Text::CSV_XS->new() or croak "Cannot use CSV: " . Text::CSV_XS->error_diag(); open my $in_fh, '<', $file or croak "Cannot open '$file': $OS_ERROR"; my $row_no = 0; while ( <$in_fh> ) { $csv->parse($_); ++$row_no; if ( scalar $csv->fields < $min_cols_no ) { croak "Invalid file format. File '$file' does not have '$min_cols_no' columns in line '$row_no'."; } } $csv->eof or $csv->error_diag(); close $in_fh or croak "Cannot close '$file': $OS_ERROR"; </code></pre> (4) Using Parse::CSV <pre class="prettyprint"><code>use Parse::CSV; my $simple = Parse::CSV->new( file => $file ); my $row_no = 0; while ( my $array_ref = $simple->fetch ) { ++$row_no; if ( scalar @$array_ref < $min_cols_no ) { croak "Invalid file format. File '$file' does not have '$min_cols_no' columns in line '$row_no'."; } } </code></pre> I benchmark-ed them using the Benchmark module. <pre class="prettyprint"><code>use Benchmark qw(timeit timestr timediff :hireswallclock); </code></pre> And these are the numbers (in seconds) that I got: 1,000 lines of file: <blockquote> Implementation 1: 0.0016 Implementation 2: 0.0025 Implementation 3: 0.0050 Implementation 4: 0.0097 </blockquote> 10,000 lines of file: <blockquote> Implementation 1: 0.0204 Implementation 2: 0.0244 Implementation 3: 0.0523 Implementation 4: 0.1050 </blockquote> 1,500,000 lines of file: <blockquote> Implementation 1: 1.8697 Implementation 2: 3.1913 Implementation 3: 7.8475 Implementation 4: 15.6274 </blockquote> Given these numbers, I would conclude that the simple parser is the fastest but from what I have read from different sources, Text::CSV_XS should be the fastest. Will someone enlighten me on this? Is there something wrong with how I used the modules? Thanks a lot for your help!

There are CSV files <pre class="prettyprint"><code>header1,header2,header3 value1,value2,value3 </code></pre> and then there are CSV files. <pre class="prettyprint"><code>header1,"This, as they say, is header2","And header3 even contains a newline!" value1,"value2, 2nd in a series of 3 values",value3 </code></pre> <code>Text::CSV</code> and its ilk have been painstakingly developed and tested to deal with the second kind. If you are confident that your input does and always will conform to the simple CSV specification, then it is very likely that you can build a parser that will outperform <code>Text::CSV</code>.

Fastest CSV Parser in Perl

Tags:

parsing

csv

perl

I am creating a subroutine that:

(1) Parses a CSV file;

(2) And checks if all the rows in that file have the expected number of columns. It croaks if the number of columns is invalid.

When the number of rows is ranging from thousands to millions, what do you think is the most efficient way to do it?

Right now, I'm trying out these implementations.

(1) Basic file parser

open my $in_fh, '<', $file or 
    croak "Cannot open '$file': $OS_ERROR";                                                            
                                                                                                      
my $row_no = 0;                                                                                           
while ( my $row = <$in_fh> ) {                                                                            
    my @values = split (q{,}, $row);                                                                      
    ++$row_no;                                                                                            
    if ( scalar @values < $min_cols_no ) {                                                                
        croak "Invalid file format. File '$file' does not have '$min_cols_no' columns in line '$row_no'.";
    }                                                                                                     
}                                                                                                         
                                                                                                      
close $in_fh                                                                                              
    or croak "Cannot close '$file': $OS_ERROR";

(2) Using Text::CSV_XS (bind_columns and csv->getline)

my $csv = Text::CSV_XS->new () or                                                                         
   croak "Cannot use CSV: " . Text::CSV_XS->error_diag();                                                 
open my $in_fh, '<', $file or                                                                             
   croak "Cannot open '$file': $OS_ERROR";                                                                
                                                                                                          
 my $row_no = 1;                                                                                          
 my @cols = @{$csv->getline($in_fh)};                                                                     
 my $row = {};                                                                                            
 $csv->bind_columns (\@{$row}{@cols});                                                                    
 while ($csv->getline ($in_fh)) {                                                                         
    ++$row_no;                                                                                            
    if ( scalar keys %$row < $min_cols_no ) {                                                             
        croak "Invalid file format. File '$file' does not have '$min_cols_no' columns in line '$row_no'.";
    }                                                                                                     
}                                                                                                         
                                                                                                          
$csv->eof or $csv->error_diag();                                                                          
close $in_fh or
    croak "Cannot close '$file': $OS_ERROR";

(3) Using Text::CSV_XS (csv->parse)

my $csv = Text::CSV_XS->new() or                                                                         
   croak "Cannot use CSV: " . Text::CSV_XS->error_diag();                                                
 open my $in_fh, '<', $file or                                                                           
   croak "Cannot open '$file': $OS_ERROR";                                                               
                                                                                                         
 my $row_no = 0;                                                                                         
 while ( <$in_fh> ) {                                                                                    
     $csv->parse($_);                                                                                    
     ++$row_no;                                                                                          
     if ( scalar $csv->fields < $min_cols_no ) {                                                         
       croak "Invalid file format. File '$file' does not have '$min_cols_no' columns in line '$row_no'.";
     }                                                                                                   
}                                                                                                        
                                                                                                         
$csv->eof or $csv->error_diag();                                                                         
close $in_fh or 
    croak "Cannot close '$file': $OS_ERROR";

(4) Using Parse::CSV

use Parse::CSV;                                                                                           
my $simple = Parse::CSV->new(                                                                             
    file => $file                                                                                         
);                                                                                                        
                                                                                                          
my $row_no = 0;                                                                                           
while ( my $array_ref = $simple->fetch ) {                                                                
    ++$row_no;                                                                                            
    if ( scalar @$array_ref < $min_cols_no ) {                                                            
        croak "Invalid file format. File '$file' does not have '$min_cols_no' columns in line '$row_no'.";
    }                                                                                                     
}

I benchmark-ed them using the Benchmark module.

use Benchmark qw(timeit timestr timediff :hireswallclock);

And these are the numbers (in seconds) that I got:

1,000 lines of file:

Implementation 1: 0.0016

Implementation 2: 0.0025

Implementation 3: 0.0050

Implementation 4: 0.0097

10,000 lines of file:

Implementation 1: 0.0204

Implementation 2: 0.0244

Implementation 3: 0.0523

Implementation 4: 0.1050

1,500,000 lines of file:

Implementation 1: 1.8697

Implementation 2: 3.1913

Implementation 3: 7.8475

Implementation 4: 15.6274

Given these numbers, I would conclude that the simple parser is the fastest but from what I have read from different sources, Text::CSV_XS should be the fastest.

Will someone enlighten me on this? Is there something wrong with how I used the modules? Thanks a lot for your help!

583

asked Dec 17 '12 15:12

Carlisle18

3 Answers

There are CSV files

header1,header2,header3
value1,value2,value3

and then there are CSV files.

header1,"This, as they say, is header2","And header3
even contains a newline!"
value1,"value2, 2nd in a series of 3 values",value3

Text::CSV and its ilk have been painstakingly developed and tested to deal with the second kind. If you are confident that your input does and always will conform to the simple CSV specification, then it is very likely that you can build a parser that will outperform Text::CSV.

166

answered Oct 11 '22 23:10

mob

Note that your Text::CSV_XS version does more than your simple parser version. It splits the line, puts it into memory, and makes your hashref point to the fields.

It also may have other logic under the hood, like allowing escaped delimiters (I don't know, as I haven't used it). On top of that, there is always a small amount of overhead when using a module: function calls, passing parameters back and forth, and perhaps generic code that doesn't really apply in your case (such as error checking for things you don't care about).

Normally the benefits of using a module greatly outweigh the costs. You get more features, more reliable code, etc. But that might not be true with a small, very simple task. If all you need to do is verify the number of columns, using a module might be overkill. You could make your own implementation even faster by just counting the number of columns, and not bothering to split at all:

/(?:,[^,]*){$min_cols_no-1}/ or croak "Did not find minimum number of columns";

If you are going to do real processing in addition to this verification step, using the module will probably be beneficial.

answered Oct 11 '22 22:10

dan1111

All CSV parsing modules do the same thing: opening the file and parse the CSV in some way, much like you did in your basic sub. They just carry a lot more overhead because internally, they do a lot more than you need (check for proper CSV format, pass around object structures etc). That makes them slower than your basic approach, to varying extent.

You benchmarked the approaches yourself; isn't the result obvious? If I didn't need the extended functionality of the CSV modules, I would parse a CSV file the basic way myself.

(I don't know if you could speed them up by improving your usage of the modules)

answered Oct 11 '22 21:10

mpe

Related questions
                            
                                How do I clear the screen in a terminal using Perl?
                            
                                Are references better return values in Perl functions?
                            
                                Coloring a perl die message
                            
                                Getting the list of subdirectories (only top level) in a directory using Perl
                            
                                Perl, how to parse XML file, xpath
                            
                                Is there a Go Language equivalent to Perls' Dumper() method in Data::Dumper?
                            
                                How can I code a factory in Perl and Moose?
                            
                                Why does my non-greedy Perl regex still match too much?
                            
                                Perl, foreach order
                            
                                CPAN Perl modules installer not finding tar file
                            
                                How can I implement a RESTful API in Perl?
                            
                                How to ignore 'Certificate Verify Failed' error in perl?
                            
                                disable warning about literal commas in qw list
                            
                                How do I declare the version number for a Perl module?
                            
                                Perl Get Parent Folder Name
                            
                                Matching a regular expression multiple times with Perl
                            
                                How do I encode a simple array into JSON in Perl?
                            
                                How can I do 64-bit arithmetic in Perl?
                            
                                Does Perl have something like Java/PHP Docs?
                            
                                How to traverse all the files in a directory; if it has subdirectories, I want to traverse files in subdirectories too

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With