Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Timestamp arithmetic in bash

Say I have two log files (input.log and output.log) with the following format:

2012-01-16T12:00:00 12345678

The first field is the processing timestamp and the second is a unique ID. I'm trying to find:

  1. The records from input.log which don't have a corresponding record for that ID in output.log
  2. The records from input.log which have a record for that ID, but the difference in the timestamps exceeds 5 seconds

I have a workaround solution with MySQL, but I'd ideally like to remove the database component and handle it with a shell script.

I have the following, which returns the lines of input.log with an added column if output.log contains the ID:

join -a1 -j2 -o 0 1.1 2.1 <(sort -k2,2 input.log) <(sort -k2,2 output.log)

Example output:

10111 2012-01-16T10:00:00 2012-01-16T10:00:04
11562 2012-01-16T11:00:00 2012-01-16T11:00:10
97554 2012-01-16T09:00:00

Main question:

Now that I have this information, how can I go about computing the differences between the 2 timestamps and discarding those over 5 seconds apart? I hit some problems processing the ISO 8601 timestamp with date (specifically the T) and assumed there must be a better way.

Edit: GNU coreutils supports ISO 8601 since late 2011, not long after this question was asked. This is likely no longer an issue for anyone. See this answer

Secondary question:

Is there perhaps a way to rework the entire approach, for instance into a single awk script? My knowledge of processing multiple files and setting up the correct inequalities for the output conditions was the limiting factor here, hence the approach above.

like image 910
cmbuckley Avatar asked Jan 17 '12 00:01

cmbuckley


2 Answers

If you have GNU awk, then you can try something like this -

gawk '
NR==FNR{a[$2]=$1;next} 
!($2 in a) {print $2,$1; next} 
($2 in a) {
  "date +%s -d " $1 | getline var1;
  "date +%s -d " a[$2] | getline var2;
  var3 = var2 - var1;
  if (var3 > 4) print $2, $1, a[$2]
}' output.log input.log

Test:

[jaypal:~/Temp] cat input.log 
2012-01-16T09:00:00 9
2012-01-16T10:00:00 10
2012-01-16T11:00:00 11

[jaypal:~/Temp] cat output.log 
2012-01-16T10:00:04 10
2012-01-16T11:00:10 11
2012-01-16T12:00:00 12

[jaypal:~/Temp] gawk '
NR==FNR{a[$2]=$1;next} 
!($2 in a) {print $2,$1; next} 
($2 in a) {"date +%s -d " $1 | getline var1; "date +%s -d " a[$2] | getline var2;var3=var2-var1;if (var3>4) print $2,$1,a[$2] }' output.log input.log
9 2012-01-16T09:00:00
11 2012-01-16T11:00:00 2012-01-16T11:00:10

Explanation:

  • NR==FNR{a[$2]=$1;next}

We start of by storing the first field in your output.log file in an array indexed on second field. We use next to prevent the other pattern{action} statements from running. Using NR==FNR allows us to slurp the output.log file completely.

  • !($2 in a) {print $2,$1; next}

Once the output.log file is completed. We start with the input.log file. We check if any second field present in input.log file is not present in our array (i.e output.log file). If found we print it. We continue this action until we have printed out all of those fields.

  • ($2 in a) {"date +%s -d " $1 | getline var1; "date +%s -d " a[$2] | getline var2; var3=var2-var1; if (var3 > 4) print $2,$1,a[$2] }

In this we look for fields that are present in both files. When we find those fields, we need to put in our logic to calculate the difference. We use the system command to find the date. Now system command by default prints to STDOUT and we have no control over them. So we pipe the output and capture the output using awk getline function and store it in a variable (var1 and var2). Once both dates are stored in a variable we do the difference and store in var3, if var3 is found to be > 4, we print it in the format you desire.

like image 191
jaypal singh Avatar answered Sep 21 '22 21:09

jaypal singh


Here's the solution I went with:

cat input.log
2012-01-16T09:00:00 9
2012-01-16T10:00:00 10
2012-01-16T11:00:00 11

cat output.log
2012-01-16T10:00:04 10
2012-01-16T11:00:10 11
2012-01-16T12:00:00 12

sort -k2,2 input.log > input.sort
sort -k2,2 output.log > output.sort

join -a1 -j2 -o 0 1.1 2.1 input.sort output.sort | while read id i o; do
    if [ -n "$o" ]; then
        ot=$(date +%s -d "${o/T/ }")
        it=$(date +%s -d "${i/T/ }")
        [[ $it+5 -lt $ot ]] && echo $id $i $o
    else echo $id $i
    fi
done
11 2012-01-16T11:00:00 2012-01-16T11:00:10
9 2012-01-16T09:00:00
like image 42
cmbuckley Avatar answered Sep 22 '22 21:09

cmbuckley