Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Amazon EMR - how to set a timeout for a step

is there a way to set a timeout for a step in Amazon Aws EMR?

I'm running a batch Apache Spark job on EMR and I would like the job to stop with a timeout if it doesn't end within 3 hours.

I cannot find a way to set a timeout not in Spark, nor in Yarn, nor in EMR configuration.

Thanks for your help!

like image 753
nicola Avatar asked Apr 21 '17 10:04

nicola


1 Answers

I would like to offer an alternative approach, without any timeout/shutdown logic making application itself more complex than needed - although I am obviously quite late to the party. Maybe it proves useful for someone in the future.

You can:

  • write a Python script and use it as a wrapper around regular Yarn commands
  • execute those Yarn commands via subprocess lib
  • parse their output according to your will
  • decide which Yarn applications should be killed

More details about what I am talking about follow...

Python wrapper script and running the Yarn commands via subprocess lib

import subprocess

running_apps = subprocess.check_output(['yarn', 'application', '--list', '--appStates', 'RUNNING'], universal_newlines=True)

This snippet would give you an output similar to something like this:

Total number of applications (application-types: [] and states: [RUNNING]):1
                Application-Id      Application-Name                                Application-Type          User       Queue               State         Final-State         Progress                        Tracking-URL
application_1554703852869_0066      HIVE-645b9a64-cb51-471b-9a98-85649ee4b86f       TEZ                       hadoop     default             RUNNING       UNDEFINED           0%                              http://ip-xx-xxx-xxx-xx.eu-west-1.compute.internal:45941/ui/

You can than parse this output (beware there might be more than one app running) and extract application-id values.

Then, for each of those application ids, you can invoke another yarn command to get more details about the specific application:

app_status_string = subprocess.check_output(['yarn', 'application', '--status', app_id], universal_newlines=True)

Output of this command should be something like this:

Application Report :
  Application-Id : application_1554703852869_0070
  Application-Name : com.organization.YourApp
  Application-Type : HIVE
  User : hadoop
  Queue : default
  Application Priority : 0
  Start-Time : 1554718311926
  Finish-Time : 0
  Progress : 10%
  State : RUNNING
  Final-State : UNDEFINED
  Tracking-URL : http://ip-xx-xxx-xxx-xx.eu-west-1.compute.internal:40817
  RPC Port : 36203
  AM Host : ip-xx-xxx-xxx-xx.eu-west-1.compute.internal
  Aggregate Resource Allocation : 51134436 MB-seconds, 9284 vcore-seconds
  Aggregate Resource Preempted : 0 MB-seconds, 0 vcore-seconds
  Log Aggregation Status : NOT_START
  Diagnostics :
  Unmanaged Application : false
  Application Node Label Expression : <Not set>
  AM container Node Label Expression : CORE

Having this you can also extract application's start time, compare it with current time and see for how long it is running. If it is running for more than some threshold number of minutes, for example you kill it.

How do you kill it? Easy.

kill_output = subprocess.check_output(['yarn', 'application', '--kill', app_id], universal_newlines=True)

This should be it, from the killing of the step/application perspective.

Automating the approach

AWS EMR has a wonderful feature called "bootstrap actions". It runs a set of actions on EMR cluster creation and can be utilized for automating this approach.

Add a bash script to bootstrap actions which is going to:

  • download the python script you just wrote to the cluster (master node)
  • add the python script to a crontab

That should be it.

P.S. I assumed Python3 is at our disposal for this purpose.

like image 86
ezamur Avatar answered Sep 19 '22 12:09

ezamur