I have a bash script that cuts out a section of a logfile between 2 timestamps, but because of the size of the files, it takes quite a while to run.
If I were to rewrite the script in Perl, could I achieve a significant speed increase - or would I have to move to something like C to accomplish this?
#!/bin/bash
if [ $# -ne 3 ]; then
echo "USAGE $0 <logfile(s)> <from date (epoch)> <to date (epoch)>"
exit 1
fi
LOGFILES=$1
FROM=$2
TO=$3
rm -f /tmp/getlogs??????
TEMP=`mktemp /tmp/getlogsXXXXXX`
## LOGS NEED TO BE LISTED CHRONOLOGICALLY
ls -lnt $LOGFILES|awk '{print $8}' > $TEMP
LOGFILES=`tac $TEMP`
cp /dev/null $TEMP
findEntry() {
RETURN=0
dt=$1
fil=$2
ln1=$3
ln2=$4
t1=`tail -n+$ln1 $fil|head -n1|cut -c1-15`
dt1=`date -d "$t1" +%s`
t2=`tail -n+$ln2 $fil|head -n1|cut -c1-15`
dt2=`date -d "$t2" +%s`
if [ $dt -ge $dt2 ]; then
mid=$dt2
else
mid=$(( (($ln2-$ln1)*($dt-$dt1)/($dt2-$dt1))+$ln1 ))
fi
t3=`tail -n+$mid $fil|head -n1|cut -c1-15`
dt3=`date -d "$t3" +%s`
# finished
if [ $dt -eq $dt3 ]; then
# FOUND IT (scroll back to the first match)
while [ $dt -eq $dt3 ]; do
mid=$(( $mid-1 ))
t3=`tail -n+$mid $fil|head -n1|cut -c1-15`
dt3=`date -d "$t3" +%s`
done
RETURN=$(( $mid+1 ))
return
fi
if [ $(( $mid-1 )) -eq $ln1 ] || [ $(( $ln2-1)) -eq $mid ]; then
# FOUND NEAR IT
RETURN=$mid
return
fi
# not finished yet
if [ $dt -lt $dt3 ]; then
# too high
findEntry $dt $fil $ln1 $mid
else
if [ $dt -ge $dt3 ]; then
# too low
findEntry $dt $fil $mid $ln2
fi
fi
}
# Check timestamps on logfiles
LOGS=""
for LOG in $LOGFILES; do
filetime=`ls -ln $LOG|awk '{print $6,$7}'`
timestamp=`date -d "$filetime" +%s`
if [ $timestamp -ge $FROM ]; then
LOGS="$LOGS $LOG"
fi
done
# Check first and last dates in LOGS to refine further
for LOG in $LOGS; do
if [ ${LOG%.gz} != $LOG ]; then
gunzip -c $LOG > $TEMP
else
cp $LOG $TEMP
fi
t=`head -n1 $TEMP|cut -c1-15`
FIRST=`date -d "$t" +%s`
t=`tail -n1 $TEMP|cut -c1-15`
LAST=`date -d "$t" +%s`
if [ $TO -lt $FIRST ] || [ $FROM -gt $LAST ]; then
# This file is entirely out of range
cp /dev/null $TEMP
else
if [ $FROM -le $FIRST ]; then
if [ $TO -ge $LAST ]; then
# Entire file is within range
cat $TEMP
else
# Last part of file is out of range
STARTLINENUMBER=1
ENDLINENUMBER=`wc -l<$TEMP`
findEntry $TO $TEMP $STARTLINENUMBER $ENDLINENUMBER
head -n$RETURN $TEMP
fi
else
if [ $TO -ge $LAST ]; then
# First part of file is out of range
STARTLINENUMBER=1
ENDLINENUMBER=`wc -l<$TEMP`
findEntry $FROM $TEMP $STARTLINENUMBER $ENDLINENUMBER
tail -n+$RETURN $TEMP
else
# range is entirely within this logfile
STARTLINENUMBER=1
ENDLINENUMBER=`wc -l<$TEMP`
findEntry $FROM $TEMP $STARTLINENUMBER $ENDLINENUMBER
n1=$RETURN
findEntry $TO $TEMP $STARTLINENUMBER $ENDLINENUMBER
n2=$RETURN
tail -n+$n1 $TEMP|head -n$(( $n2-$n1 ))
fi
fi
fi
done
rm -f /tmp/getlogs??????
Perl scripts are usually (if not 100% of the times) faster than bash. Show activity on this post. bash isn't a language so much as a command interpreter that's been hacked to death to allow for things that make it look like a scripting language.
Shell scripting is more handy when it comes to typing a few lines & if your script targets internal commands. When you need to call external programs, use Perl. Perl is more targeted at text-processing, so it makes it more powerful in that regard as compared to shell scripts.
Bash is a general purpose scripting language just like Python, Ruby, Perl, but each has different strengths over the rest.
Both shells are 2–30 times faster than bash depending on the test.
Perl is absurdly faster than Bash. And, for text manipulation, you can actually achieve better performances with Perl than with C, unless you take time to write complex algorithms. Of course, for simple stuff C can be unbeatable.
That said, if your "bash" script is not looping, just calling other programs, then there isn't any gain to be had. For example, if your script looks like "cat X | grep Y | tr -f 3-5 | sort | uniq
", then most of the time is spent on cat, grep, tr, sort and uniq, NOT on Bash.
You'll gain performance if there is any loop in the script, or if you save multiple reads of the same file.
You say you cut stuff between two timestamps on a file. Let's say your Bash script looks like this:
LINE1=`grep -n TIMESTAMP1 filename | head -1 | cut -d ':' -f 1`
LINE2=`grep -n TIMESTAMP2 filename | head -1 | cut -d ':' -f 1`
tail +$LINE1 filename | head -$(($LINE2-$LINE1))
Then you'll gain performance, because you are reading the whole file three times: once for each command where "filename" appears. In Perl, you would do something like this:
my $state = 0;
while(<>) {
exit if /TIMESTAMP2/;
print $_ if $state == 1;
$state = 1 if /TIMESTAMP1/;
}
This will read the file only once and will also stop once you read TIMESTAMP2. Since you are processing multiple files, you'd use "last" or "break" instead of "exit", so that the script can continue to process the files.
Anyway, seeing your script I'm positive you'll gain a lot by rewriting it in Perl. Notwithstanding the loops dealing with file names (whose speed WILL be improved, but is probably insignificant), for each file which is not fully inside or outside scope you do:
Furthermore, head your tails. Each time you do that, some piece of code is reading that data. Some of those lines are being read up to 10 times or more!
Updated script based on Brent's comment: This one is untested.
#!/usr/bin/perl
use strict;
use warnings;
my %months = (
jan => 1, feb => 2, mar => 3, apr => 4,
may => 5, jun => 6, jul => 7, aug => 8,
sep => 9, oct => 10, nov => 11, dec => 12,
);
while ( my $line = <> ) {
my $ts = substr $line, 0, 15;
next if parse_date($ts) lt '0201100543';
last if parse_date($ts) gt '0715123456';
print $line;
}
sub parse_date {
my ($month, $day, $time) = split ' ', $_[0];
my ($hour, $min, $sec) = split /:/, $time;
return sprintf(
'%2.2d%2.2d%2.2d%2.2d%2.2d',
$months{lc $month}, $day,
$hour, $min, $sec,
);
}
__END__
Previous answer for reference: What is the format of the file? Here is a short script which assumes the first column is a timestamp and prints only lines that have timestamps in a certain range. It also assumes that the timestamps are sorted. On my system, it took about a second to filter 900,000 lines out of a million:
#!/usr/bin/perl
use strict;
use warnings;
while ( <> ) {
my ($ts) = split;
next if $ts < 1247672719;
last if $ts > 1252172093;
print $ts, "\n";
}
__END__
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With