I'm trying to execute spark-submit using boto3 client for EMR. After executing the code below, EMR step submitted and after few seconds failed. The actual command line from step logs is working if executed manually on EMR master.
Controller log shows hardly readable garbage, looking like several processes writing there concurrently.
UPD: Tried command-runner.jar and EMR versions 4.0.0 and 4.1.0
Any idea appreciated.
The code fragment:
class ProblemExample:
def run(self):
session = boto3.Session(profile_name='emr-profile')
client = session.client('emr')
response = client.add_job_flow_steps(
JobFlowId=cluster_id,
Steps=[
{
'Name': 'string',
'ActionOnFailure': 'CONTINUE',
'HadoopJarStep': {
'Jar': 's3n://elasticmapreduce/libs/script-runner/script-runner.jar',
'Args': [
'/usr/bin/spark-submit',
'--verbose',
'--class',
'my.spark.job',
'--jars', '<dependencies>',
'<my spark job>.jar'
]
}
},
]
)
Finally the problem resolved by escaping --jars values properly.
spark-submit was failing not finding classes, but on the background of messy logs the error is not clear.
The valid example is:
class Example:
def run(self):
session = boto3.Session(profile_name='emr-profile')
client = session.client('emr')
response = client.add_job_flow_steps(
JobFlowId=cluster_id,
Steps=[
{
'Name': 'string',
'ActionOnFailure': 'CONTINUE',
'HadoopJarStep': {
'Jar': 'command-runner.jar',
'Args': [
'/usr/bin/spark-submit',
'--verbose',
'--class',
'my.spark.job',
'--jars', '\'<coma, separated, dependencies>\'',
'<my spark job>.jar'
]
}
},
]
)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With