I would like to optimise this Perl sub:
push_csv($string,$addthis,$position);
for placing strings inside a CSV string.
e.g. if $string="one,two,,four"; $addthis="three"; $position=2;
then push_csv($string,$addthis,$position)
will change the value of $string = "one,two,three,four";
sub push_csv {
my @fields = split /,/, $_[0]; # split original string by commas;
$_[1] =~ s/,//g; # remove commas in $addthis
$fields[$_[2]] = $_[1]; # put the $addthis string into
# the array position $position.
$_[0] = join ",", @fields; # join the array with commas back
# into the string.
}
This is a bottleneck in my code, as it needs to be called a few million times.
If you are proficient in Perl, could you take a look at it, and suggest optimisation/alternatives? Thanks in advance! :)
EDIT: Converting to @fields and back to string is taking time, I just thought of a way to speed it up where I have more than one sub call in a row. Split once, then push more than one thing into the array, then join once at the end.
For several reasons, you should be using Text::CSV to handle these low-level CSV details. Provided that you are able to install the XS version, my understanding is that it will run faster than anything you can do in pure Perl. In addition, the module will correctly handle all sorts of edge cases that you are likely to miss.
use Text::CSV;
my $csv = Text::CSV->new;
my $line = 'foo,,fubb';
$csv->parse($line);
my @fields = $csv->fields;
$fields[1] = 'bar';
$csv->combine(@fields);
print $csv->string; # foo,bar,fubb
Keep your array as an array in the first place, not as a ,-separated string?
You might want to have a look at Data::Locations.
Or try (untested, unbenchmarked, doesn't append new fields like your original can...)
sub push_csv {
$_[1] =~ y/,//d;
$_[0] =~ s/^(?:[^,]*,){$_[2]}\K[^,]*/$_[1]/;
return;
}
A few suggestions:
tr/,//d
instead of s/,//g
as it is faster. This is essentially the same as ysth's suggestion to use y/,//d
split
only as much as is needed. If $position = 1
, and you have 10 fields, then you're wasting computation performing unnecessary splits and joins. The optional third argument to split
can be leveraged to your advantage here. However, this does depend on how many consecutive empty fields you are expecting. It may not be worth it if you don't know ahead of time how many of these you have
You really ought to be using Text::CSV
, but here's how I would revise the implementation of your sub in pure Perl (assuming a maximum of one consecutive empty field):
sub push_csv {
my ( $items, $positions ) = @_[1..2];
# Test inputs
warn "No. of items to add & positions not equal"
and
return unless @{$items} == @{$positions};
my $maxPos; # Find the maximum position number
for my $position ( @{$positions} ) {
$maxPos ||= $position;
$maxPos = $position if $maxPos < $position;
}
my @fields = split /,/ , $_[0], $maxPos+2; # Split only as much as needed
splice ( @fields, $positions->[$_], 1, $items->[$_] ) for 0 .. $#{$items};
$_[0] = join ',' , @fields;
print $_[0],"\n";
}
use strict;
use warnings;
my $csvString = 'one,two,,four,,six';
my @missing = ( 'three', 'five' );
my @positions = ( 2, 4 );
push_csv ( $csvString, \@missing, \@positions );
print $csvString; # Prints 'one,two,three,four,five,six'
If you're hitting a bottleneck by split
ting and join
ing a few million times... then don't split
and join
constantly. split
each line once when it initially enters the system, pass that array (or, more likely, a reference to the array) around while doing your processing, and then do a single join
to turn it into a string when you're ready for it to leave the system.
e.g.:
#!/usr/bin/env perl
use strict;
use warnings;
# Start with some CSV-ish data
my $initial_data = 'foo,bar,baz';
# Split it into an arrayref
my $data = [ split /,/, $initial_data ];
for (1 .. 1_000_000) {
# Pointless call to push_csv, just to make it run
push_csv($data, $_, $_ % 3);
}
# Turn it back into a string and display it
my $final_data = join ',', @$data;
print "Result: $final_data\n";
sub push_csv {
my ($data_ref, $value, $position) = @_;
$$data_ref[$position] = $value;
# Alternately:
# $data_ref->[$position] = $value;
}
Note that this simplifies things enough that push_csv
becomes a single, rather simple, line of processing, so you may want to just do the change inline instead of calling a sub for it, especially if runtime efficiency is a key criterion - in this trivial example, getting rid of push_csv
and doing it inline reduced run time by about 70% (from 0.515s to 0.167s).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With