Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Perl sub optimisation push string into csv using split

Tags:

perl

I would like to optimise this Perl sub:

push_csv($string,$addthis,$position);

for placing strings inside a CSV string.

e.g. if $string="one,two,,four"; $addthis="three"; $position=2;
then push_csv($string,$addthis,$position) will change the value of $string = "one,two,three,four";

sub push_csv {

    my @fields = split /,/, $_[0]; # split original string by commas;
    $_[1] =~ s/,//g;               # remove commas in $addthis
    $fields[$_[2]] = $_[1];        # put the $addthis string into
                                   # the array position $position.
    $_[0] = join ",", @fields;     # join the array with commas back
                                   # into the string.
}

This is a bottleneck in my code, as it needs to be called a few million times.

If you are proficient in Perl, could you take a look at it, and suggest optimisation/alternatives? Thanks in advance! :)


EDIT: Converting to @fields and back to string is taking time, I just thought of a way to speed it up where I have more than one sub call in a row. Split once, then push more than one thing into the array, then join once at the end.

like image 372
Dave Avatar asked Jun 25 '10 23:06

Dave


4 Answers

For several reasons, you should be using Text::CSV to handle these low-level CSV details. Provided that you are able to install the XS version, my understanding is that it will run faster than anything you can do in pure Perl. In addition, the module will correctly handle all sorts of edge cases that you are likely to miss.

use Text::CSV;
my $csv = Text::CSV->new;

my $line = 'foo,,fubb';
$csv->parse($line);

my @fields = $csv->fields;
$fields[1] = 'bar';

$csv->combine(@fields); 
print $csv->string;      # foo,bar,fubb
like image 184
FMc Avatar answered Nov 19 '22 18:11

FMc


Keep your array as an array in the first place, not as a ,-separated string?

You might want to have a look at Data::Locations.

Or try (untested, unbenchmarked, doesn't append new fields like your original can...)

sub push_csv {
    $_[1] =~ y/,//d;
    $_[0] =~ s/^(?:[^,]*,){$_[2]}\K[^,]*/$_[1]/;
    return;
}
like image 2
ysth Avatar answered Nov 19 '22 18:11

ysth


A few suggestions:

  • Use tr/,//d instead of s/,//g as it is faster. This is essentially the same as ysth's suggestion to use y/,//d
  • Perform split only as much as is needed. If $position = 1, and you have 10 fields, then you're wasting computation performing unnecessary splits and joins. The optional third argument to split can be leveraged to your advantage here. However, this does depend on how many consecutive empty fields you are expecting. It may not be worth it if you don't know ahead of time how many of these you have
  • You're quite right in wanting to perform multiple appends with one sub-call. There is no need to perform multiple splits and joins when one will do just as well

You really ought to be using Text::CSV, but here's how I would revise the implementation of your sub in pure Perl (assuming a maximum of one consecutive empty field):

sub push_csv {

    my ( $items, $positions ) = @_[1..2];

    # Test inputs
    warn "No. of items to add & positions not equal"
      and
        return unless @{$items} == @{$positions};

    my $maxPos;  # Find the maximum position number

    for my $position ( @{$positions} ) {

        $maxPos ||= $position;
        $maxPos = $position if $maxPos < $position;
    }

    my @fields = split /,/ , $_[0], $maxPos+2;  # Split only as much as needed

    splice ( @fields, $positions->[$_], 1, $items->[$_] ) for 0 .. $#{$items};
    $_[0] = join ',' , @fields;
    print $_[0],"\n";
}

Usage

use strict;
use warnings;

my $csvString = 'one,two,,four,,six';
my @missing = ( 'three', 'five' );
my @positions = ( 2, 4 );

push_csv ( $csvString, \@missing, \@positions );
print $csvString;   # Prints 'one,two,three,four,five,six'
like image 1
Zaid Avatar answered Nov 19 '22 18:11

Zaid


If you're hitting a bottleneck by splitting and joining a few million times... then don't split and join constantly. split each line once when it initially enters the system, pass that array (or, more likely, a reference to the array) around while doing your processing, and then do a single join to turn it into a string when you're ready for it to leave the system.

e.g.:

#!/usr/bin/env perl

use strict;
use warnings;

# Start with some CSV-ish data
my $initial_data = 'foo,bar,baz';

# Split it into an arrayref
my $data = [ split /,/, $initial_data ];

for (1 .. 1_000_000) {
  # Pointless call to push_csv, just to make it run
  push_csv($data, $_, $_ % 3);
}

# Turn it back into a string and display it
my $final_data = join ',', @$data;
print "Result: $final_data\n";

sub push_csv {
  my ($data_ref, $value, $position) = @_;
  $$data_ref[$position] = $value;
  # Alternately:
  # $data_ref->[$position] = $value;
}

Note that this simplifies things enough that push_csv becomes a single, rather simple, line of processing, so you may want to just do the change inline instead of calling a sub for it, especially if runtime efficiency is a key criterion - in this trivial example, getting rid of push_csv and doing it inline reduced run time by about 70% (from 0.515s to 0.167s).

like image 1
Dave Sherohman Avatar answered Nov 19 '22 17:11

Dave Sherohman