is there a way to set a timeout for a step in Amazon Aws EMR?
I'm running a batch Apache Spark job on EMR and I would like the job to stop with a timeout if it doesn't end within 3 hours.
I cannot find a way to set a timeout not in Spark, nor in Yarn, nor in EMR configuration.
Thanks for your help!
I would like to offer an alternative approach, without any timeout/shutdown logic making application itself more complex than needed - although I am obviously quite late to the party. Maybe it proves useful for someone in the future.
You can:
More details about what I am talking about follow...
Python wrapper script and running the Yarn commands via subprocess lib
import subprocess
running_apps = subprocess.check_output(['yarn', 'application', '--list', '--appStates', 'RUNNING'], universal_newlines=True)
This snippet would give you an output similar to something like this:
Total number of applications (application-types: [] and states: [RUNNING]):1
Application-Id Application-Name Application-Type User Queue State Final-State Progress Tracking-URL
application_1554703852869_0066 HIVE-645b9a64-cb51-471b-9a98-85649ee4b86f TEZ hadoop default RUNNING UNDEFINED 0% http://ip-xx-xxx-xxx-xx.eu-west-1.compute.internal:45941/ui/
You can than parse this output (beware there might be more than one app running) and extract application-id values.
Then, for each of those application ids, you can invoke another yarn command to get more details about the specific application:
app_status_string = subprocess.check_output(['yarn', 'application', '--status', app_id], universal_newlines=True)
Output of this command should be something like this:
Application Report :
Application-Id : application_1554703852869_0070
Application-Name : com.organization.YourApp
Application-Type : HIVE
User : hadoop
Queue : default
Application Priority : 0
Start-Time : 1554718311926
Finish-Time : 0
Progress : 10%
State : RUNNING
Final-State : UNDEFINED
Tracking-URL : http://ip-xx-xxx-xxx-xx.eu-west-1.compute.internal:40817
RPC Port : 36203
AM Host : ip-xx-xxx-xxx-xx.eu-west-1.compute.internal
Aggregate Resource Allocation : 51134436 MB-seconds, 9284 vcore-seconds
Aggregate Resource Preempted : 0 MB-seconds, 0 vcore-seconds
Log Aggregation Status : NOT_START
Diagnostics :
Unmanaged Application : false
Application Node Label Expression : <Not set>
AM container Node Label Expression : CORE
Having this you can also extract application's start time, compare it with current time and see for how long it is running. If it is running for more than some threshold number of minutes, for example you kill it.
How do you kill it? Easy.
kill_output = subprocess.check_output(['yarn', 'application', '--kill', app_id], universal_newlines=True)
This should be it, from the killing of the step/application perspective.
Automating the approach
AWS EMR has a wonderful feature called "bootstrap actions". It runs a set of actions on EMR cluster creation and can be utilized for automating this approach.
Add a bash script to bootstrap actions which is going to:
That should be it.
P.S. I assumed Python3 is at our disposal for this purpose.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With