I am using the Slurm job scheduler to run my jobs on a cluster. What is the most efficient way to submit the Slurm jobs and check on their status using Apache Airflow?
I was able to use a SSHOperator to submit my jobs remotely and check on their status every minute until it is completed but I wonder if anyone knows a better way. Below is the SSHOperator I wrote.
sshHook = SSHHook(ssh_conn_id='my_conn_id',keepalive_interval=240)
task_ssh_bash = """
cd ~/projects &&
JID=$(sbatch myjob.sh)
echo $JID
sleep 10s # needed
ST="PENDING"
while [ "$ST" != "COMPLETED" ] ; do
ST=$(sacct -j ${JID##* } -o State | awk 'FNR == 3 {print $1}')
sleep 1m
if [ "$ST" == "FAILED" ]; then
echo 'Job final status:' $ST, exiting...
exit 122
fi
echo $ST
"""
task_ssh = SSHOperator(
task_id='test_ssh_operator',
ssh_hook=sshHook,
do_xcom_push=True,
command=task_ssh_bash,
dag=dag)
I can't give a demonstrateable example but my inclination would be to implement an airflow sensor on top of something like pyslurm. Funnily enough I came across your question merely when looking to see if anyone has already done this!
EDIT: There is an interesting topic on regarding the use of excecutors for submitting jobs too
Best of luck
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With