Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

GNU Parallel - which job failed?

I'm running a job on several different servers (up to 25) using GNU parallel.

The shell script which implements this currently does:

parallel --tag --nonall -S $some_list_of_servers "some_command"
state=$?
echo -n "RESULT: "
if [ "$state" -eq "0" ]
then
    echo "All jobs successful"
else
    echo "$state jobs failed"
fi
return $state

where some_list_of_servers is an array, and install_command is, for instance, git fetch.

What I want is LOT more information than just how many jobs failed. I want to know which command, and which server, failed.

I've been through the man page, and google, and SO but can't find the switch(es) that I'm looking for.

Any help gratefully appreciated.

WeeDom

EDIT in response to Answer 1:

I tried that, and something odd is happening.

weedom@host1: ~/$ parallel --tag --nonall  -j8 --joblog test.log -S host1,host2 uptime 
host2   10:41:17 up 36 days, 20:45,  1 user,  load average: 0.00, 0.00, 0.00
host1         10:41:17 up 22:34,  3 users,  load average: 0.06, 0.11, 0.04
weedom@host1: ~/$ cat test.log
Seq     Host    Starttime       Runtime Send    Receive Exitval Signal  Command
1       host1        1403689277.067  0.519999980926514       0       0       0      0       uptime

No matter how many hosts I add to -S, I seem to only get the last one to complete into test.log

I've added a follow-up question here: GNU Parallel - --joblog only logging last job

like image 629
WeeDom Avatar asked Jun 23 '14 15:06

WeeDom


1 Answers

You want to use the --joblog option, as shown in the docs. Gnu parallel even allows restarting just the failed ones with --resume-failed.

eg, running this script:

#!/bin/bash
jobmod=$(( $1 % 3 ))
if [ $jobmod == 0 ]
then
    exit 1
else
    exit 0
fi 

on several hosts like this:

$ seq 1 10 | parallel --joblog out.log -S "srv01,srv02,srv03,srv04" ./failjob 

gives

$ more out.log
Seq Host    Starttime   Runtime Send    Receive Exitval Signal  Command
1   srv01   1403542514.713  0.267   0   0   0   0   ./failjob 1
3   srv02   1403542514.717  0.266   0   0   1   0   ./failjob 3
4   srv03   1403542514.719  0.266   0   0   0   0   ./failjob 4
2   srv04   1403542514.715  0.397   0   0   0   0   ./failjob 2
5   srv01   1403542514.983  0.231   0   0   0   0   ./failjob 5
6   srv02   1403542514.986  0.368   0   0   1   0   ./failjob 6
7   srv03   1403542514.988  0.388   0   0   0   0   ./failjob 7
8   srv04   1403542515.121  0.437   0   0   0   0   ./failjob 8
9   srv01   1403542515.221  0.343   0   0   1   0   ./failjob 9
10  srv02   1403542515.356  0.388   0   0   0   0   ./failjob 10
like image 183
Jonathan Dursi Avatar answered Sep 27 '22 19:09

Jonathan Dursi