Is Perl faster than bash?

Tags:

I have a bash script that cuts out a section of a logfile between 2 timestamps, but because of the size of the files, it takes quite a while to run.

If I were to rewrite the script in Perl, could I achieve a significant speed increase - or would I have to move to something like C to accomplish this?

#!/bin/bash

if [ $# -ne 3 ]; then
  echo "USAGE $0 <logfile(s)> <from date (epoch)> <to date (epoch)>"
  exit 1
fi

LOGFILES=$1
FROM=$2
TO=$3
rm -f /tmp/getlogs??????
TEMP=`mktemp /tmp/getlogsXXXXXX`

## LOGS NEED TO BE LISTED CHRONOLOGICALLY
ls -lnt $LOGFILES|awk '{print $8}' > $TEMP
LOGFILES=`tac $TEMP`
cp /dev/null $TEMP

findEntry() {
  RETURN=0
  dt=$1
  fil=$2
  ln1=$3
  ln2=$4
  t1=`tail -n+$ln1 $fil|head -n1|cut -c1-15`
  dt1=`date -d "$t1" +%s`
  t2=`tail -n+$ln2 $fil|head -n1|cut -c1-15`
  dt2=`date -d "$t2" +%s`
  if [ $dt -ge $dt2 ]; then
    mid=$dt2
  else
    mid=$(( (($ln2-$ln1)*($dt-$dt1)/($dt2-$dt1))+$ln1 ))
  fi
  t3=`tail -n+$mid $fil|head -n1|cut -c1-15`
  dt3=`date -d "$t3" +%s`
  # finished
  if [ $dt -eq $dt3 ]; then
    # FOUND IT (scroll back to the first match)
    while [ $dt -eq $dt3 ]; do
      mid=$(( $mid-1 ))
      t3=`tail -n+$mid $fil|head -n1|cut -c1-15`
      dt3=`date -d "$t3" +%s`
    done
    RETURN=$(( $mid+1 ))
    return
  fi
  if [ $(( $mid-1 )) -eq $ln1 ] || [ $(( $ln2-1)) -eq $mid ]; then
    # FOUND NEAR IT
    RETURN=$mid
    return
  fi
  # not finished yet
  if [ $dt -lt $dt3 ]; then
    # too high
    findEntry $dt $fil $ln1 $mid
  else
    if [ $dt -ge $dt3 ]; then
      # too low
      findEntry $dt $fil $mid $ln2
    fi
  fi
}

# Check timestamps on logfiles
LOGS=""
for LOG in $LOGFILES; do
  filetime=`ls -ln $LOG|awk '{print $6,$7}'`
  timestamp=`date -d "$filetime" +%s`
  if [ $timestamp -ge $FROM ]; then
    LOGS="$LOGS $LOG"
  fi
done

# Check first and last dates in LOGS to refine further
for LOG in $LOGS; do
    if [ ${LOG%.gz} != $LOG ]; then
      gunzip -c $LOG > $TEMP
    else
      cp $LOG $TEMP
    fi
    t=`head -n1 $TEMP|cut -c1-15`
    FIRST=`date -d "$t" +%s`
    t=`tail -n1 $TEMP|cut -c1-15`
    LAST=`date -d "$t" +%s`
    if [ $TO -lt $FIRST ] || [ $FROM -gt $LAST ]; then
      # This file is entirely out of range
      cp /dev/null $TEMP
    else
      if [ $FROM -le $FIRST ]; then
        if [ $TO -ge $LAST ]; then
          # Entire file is within range
          cat $TEMP
        else
          # Last part of file is out of range
          STARTLINENUMBER=1
          ENDLINENUMBER=`wc -l<$TEMP`
          findEntry $TO $TEMP $STARTLINENUMBER $ENDLINENUMBER
          head -n$RETURN $TEMP
        fi
      else
        if [ $TO -ge $LAST ]; then
          # First part of file is out of range
          STARTLINENUMBER=1
          ENDLINENUMBER=`wc -l<$TEMP`
          findEntry $FROM $TEMP $STARTLINENUMBER $ENDLINENUMBER
          tail -n+$RETURN $TEMP
        else
          # range is entirely within this logfile
          STARTLINENUMBER=1
          ENDLINENUMBER=`wc -l<$TEMP`
          findEntry $FROM $TEMP $STARTLINENUMBER $ENDLINENUMBER
          n1=$RETURN
          findEntry $TO $TEMP $STARTLINENUMBER $ENDLINENUMBER
          n2=$RETURN
          tail -n+$n1 $TEMP|head -n$(( $n2-$n1 ))
        fi
      fi
    fi
done
rm -f /tmp/getlogs??????

717

asked Jul 15 '09 15:07

Brent

2 Answers

Perl is absurdly faster than Bash. And, for text manipulation, you can actually achieve better performances with Perl than with C, unless you take time to write complex algorithms. Of course, for simple stuff C can be unbeatable.

That said, if your "bash" script is not looping, just calling other programs, then there isn't any gain to be had. For example, if your script looks like "cat X | grep Y | tr -f 3-5 | sort | uniq", then most of the time is spent on cat, grep, tr, sort and uniq, NOT on Bash.

You'll gain performance if there is any loop in the script, or if you save multiple reads of the same file.

You say you cut stuff between two timestamps on a file. Let's say your Bash script looks like this:

LINE1=`grep -n TIMESTAMP1 filename | head -1 | cut -d ':' -f 1`
LINE2=`grep -n TIMESTAMP2 filename | head -1 | cut -d ':' -f 1`
tail +$LINE1 filename | head -$(($LINE2-$LINE1))

Then you'll gain performance, because you are reading the whole file three times: once for each command where "filename" appears. In Perl, you would do something like this:

my $state = 0;
while(<>) {
  exit if /TIMESTAMP2/;
  print $_ if $state == 1;
  $state = 1 if /TIMESTAMP1/;
}

This will read the file only once and will also stop once you read TIMESTAMP2. Since you are processing multiple files, you'd use "last" or "break" instead of "exit", so that the script can continue to process the files.

Anyway, seeing your script I'm positive you'll gain a lot by rewriting it in Perl. Notwithstanding the loops dealing with file names (whose speed WILL be improved, but is probably insignificant), for each file which is not fully inside or outside scope you do:

Read the WHOLE file to count lines!
Do multiple tails on the file
Finish by "head" or "tail" the file once again

Furthermore, head your tails. Each time you do that, some piece of code is reading that data. Some of those lines are being read up to 10 times or more!

100

answered Sep 20 '22 09:09

Daniel C. Sobral

Updated script based on Brent's comment: This one is untested.

#!/usr/bin/perl

use strict;
use warnings;

my %months = (
    jan => 1, feb => 2,  mar => 3,  apr => 4,
    may => 5, jun => 6,  jul => 7,  aug => 8,
    sep => 9, oct => 10, nov => 11, dec => 12,
);

while ( my $line = <> ) {
    my $ts = substr $line, 0, 15;
    next if parse_date($ts) lt '0201100543';
    last if parse_date($ts) gt '0715123456';
    print $line;
}

sub parse_date {
    my ($month, $day, $time) = split ' ', $_[0];
    my ($hour, $min, $sec) = split /:/, $time;
    return sprintf(
        '%2.2d%2.2d%2.2d%2.2d%2.2d',
        $months{lc $month}, $day,
        $hour, $min, $sec,
    );
}


__END__

Previous answer for reference: What is the format of the file? Here is a short script which assumes the first column is a timestamp and prints only lines that have timestamps in a certain range. It also assumes that the timestamps are sorted. On my system, it took about a second to filter 900,000 lines out of a million:

#!/usr/bin/perl

use strict;
use warnings;

while ( <> ) {
    my ($ts) = split;
    next if $ts < 1247672719;
    last if $ts > 1252172093;
    print $ts, "\n";
}

__END__

answered Sep 21 '22 09:09

Sinan Ünür

Related questions
                            
                                How can I create a random number between two numbers in Perl
                            
                                Why doesn't my Perl one-liner work on Windows?
                            
                                How do I create drag-and-drop Strawberry Perl programs?
                            
                                How can I split a Perl string only on the last occurrence of the separator?
                            
                                How to insert hash into hash in Perl
                            
                                Writing a macro in Perl
                            
                                How do I shrink an array in Perl?
                            
                                If statements inside or outside loops?
                            
                                Extract text from a multiline string using Perl
                            
                                How can I redirect standard output to a file in Perl?
                            
                                Why doesn't Perl's SUPER call use the arrow method?
                            
                                Perl: Find and replace specific string in multiple text file
                            
                                Passing two or more arrays to a Perl subroutine
                            
                                How do you check the success of open (file) in Perl?
                            
                                Perl one line if statement
                            
                                How can I tell Perl to run some code every 20 seconds?
                            
                                How can I determine the local machine's IP addresses from Perl?
                            
                                How can I join two hashes in Perl without using a loop?
                            
                                What is Perl's secret of getting small code do so much?
                            
                                How do I get yesterday's date using localtime?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is Perl faster than bash?

Tags:

performance

bash

optimization

comparison

perl

Brent

People also ask

2 Answers

Daniel C. Sobral

Sinan Ünür

Recent Activity

Donate For Us