I am creating a subroutine that:
(1) Parses a CSV file;
(2) And checks if all the rows in that file have the expected number of columns. It croaks if the number of columns is invalid.
When the number of rows is ranging from thousands to millions, what do you think is the most efficient way to do it?
Right now, I'm trying out these implementations.
(1) Basic file parser
open my $in_fh, '<', $file or
croak "Cannot open '$file': $OS_ERROR";
my $row_no = 0;
while ( my $row = <$in_fh> ) {
my @values = split (q{,}, $row);
++$row_no;
if ( scalar @values < $min_cols_no ) {
croak "Invalid file format. File '$file' does not have '$min_cols_no' columns in line '$row_no'.";
}
}
close $in_fh
or croak "Cannot close '$file': $OS_ERROR";
(2) Using Text::CSV_XS (bind_columns and csv->getline)
my $csv = Text::CSV_XS->new () or
croak "Cannot use CSV: " . Text::CSV_XS->error_diag();
open my $in_fh, '<', $file or
croak "Cannot open '$file': $OS_ERROR";
my $row_no = 1;
my @cols = @{$csv->getline($in_fh)};
my $row = {};
$csv->bind_columns (\@{$row}{@cols});
while ($csv->getline ($in_fh)) {
++$row_no;
if ( scalar keys %$row < $min_cols_no ) {
croak "Invalid file format. File '$file' does not have '$min_cols_no' columns in line '$row_no'.";
}
}
$csv->eof or $csv->error_diag();
close $in_fh or
croak "Cannot close '$file': $OS_ERROR";
(3) Using Text::CSV_XS (csv->parse)
my $csv = Text::CSV_XS->new() or
croak "Cannot use CSV: " . Text::CSV_XS->error_diag();
open my $in_fh, '<', $file or
croak "Cannot open '$file': $OS_ERROR";
my $row_no = 0;
while ( <$in_fh> ) {
$csv->parse($_);
++$row_no;
if ( scalar $csv->fields < $min_cols_no ) {
croak "Invalid file format. File '$file' does not have '$min_cols_no' columns in line '$row_no'.";
}
}
$csv->eof or $csv->error_diag();
close $in_fh or
croak "Cannot close '$file': $OS_ERROR";
(4) Using Parse::CSV
use Parse::CSV;
my $simple = Parse::CSV->new(
file => $file
);
my $row_no = 0;
while ( my $array_ref = $simple->fetch ) {
++$row_no;
if ( scalar @$array_ref < $min_cols_no ) {
croak "Invalid file format. File '$file' does not have '$min_cols_no' columns in line '$row_no'.";
}
}
I benchmark-ed them using the Benchmark module.
use Benchmark qw(timeit timestr timediff :hireswallclock);
And these are the numbers (in seconds) that I got:
1,000 lines of file:
Implementation 1: 0.0016
Implementation 2: 0.0025
Implementation 3: 0.0050
Implementation 4: 0.0097
10,000 lines of file:
Implementation 1: 0.0204
Implementation 2: 0.0244
Implementation 3: 0.0523
Implementation 4: 0.1050
1,500,000 lines of file:
Implementation 1: 1.8697
Implementation 2: 3.1913
Implementation 3: 7.8475
Implementation 4: 15.6274
Given these numbers, I would conclude that the simple parser is the fastest but from what I have read from different sources, Text::CSV_XS should be the fastest.
Will someone enlighten me on this? Is there something wrong with how I used the modules? Thanks a lot for your help!
To read a column from csv for this purpose I wrote this script: #!/usr/bin/perl -w use strict; use warnings; use Text::CSV; my$column_separator = qr/,/; my $column_number = "3"; my$file = "/home/Admin/Documents/new (copy).
If a server says "This data is of type text/csv" the client can understand that can render that data internally, while if the server says "This data is of type application/csv" the client knows that it needs to launch the application that is registered on the OS to open csv files.
The Comma Separated Values (CSV) Parser reads and writes data in a CSV format. Note: In the Config Editor, the parameters are set in the Parser tab of the Connector.
There are CSV files
header1,header2,header3
value1,value2,value3
and then there are CSV files.
header1,"This, as they say, is header2","And header3
even contains a newline!"
value1,"value2, 2nd in a series of 3 values",value3
Text::CSV
and its ilk have been painstakingly developed and tested to deal with the second kind. If you are confident that your input does and always will conform to the simple CSV specification, then it is very likely that you can build a parser that will outperform Text::CSV
.
Note that your Text::CSV_XS
version does more than your simple parser version. It splits the line, puts it into memory, and makes your hashref point to the fields.
It also may have other logic under the hood, like allowing escaped delimiters (I don't know, as I haven't used it). On top of that, there is always a small amount of overhead when using a module: function calls, passing parameters back and forth, and perhaps generic code that doesn't really apply in your case (such as error checking for things you don't care about).
Normally the benefits of using a module greatly outweigh the costs. You get more features, more reliable code, etc. But that might not be true with a small, very simple task. If all you need to do is verify the number of columns, using a module might be overkill. You could make your own implementation even faster by just counting the number of columns, and not bothering to split at all:
/(?:,[^,]*){$min_cols_no-1}/ or croak "Did not find minimum number of columns";
If you are going to do real processing in addition to this verification step, using the module will probably be beneficial.
All CSV parsing modules do the same thing: opening the file and parse the CSV in some way, much like you did in your basic sub. They just carry a lot more overhead because internally, they do a lot more than you need (check for proper CSV format, pass around object structures etc). That makes them slower than your basic approach, to varying extent.
You benchmarked the approaches yourself; isn't the result obvious? If I didn't need the extended functionality of the CSV modules, I would parse a CSV file the basic way myself.
(I don't know if you could speed them up by improving your usage of the modules)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With