I'm trying to migrate a couple of MR jobs that I have written in python from AWS EMR 2.4 to AWS EMR 5.0. Till now I was using boto 2.4, but it doesn't support EMR 5.0, so I'm trying to shift to boto3. Earlier, while using boto 2.4, I used the StreamingStep
module to specify input location and output location, as well as the location of my mapper and reducer source files. Using this module, I effectively didn't have to create or upload any jar to run my jobs. However, I cannot find the equivalent for this module anywhere in the boto3 documentation. How can I add a streaming step in boto3 to my MR job, so that I don't have to upload a jar file to run it?
It's unfortunate that boto3 and EMR API are rather poorly documented. Minimally, the word counting example would look as follows:
import boto3
emr = boto3.client('emr')
resp = emr.run_job_flow(
Name='myjob',
ReleaseLabel='emr-5.0.0',
Instances={
'InstanceGroups': [
{'Name': 'master',
'InstanceRole': 'MASTER',
'InstanceType': 'c1.medium',
'InstanceCount': 1,
'Configurations': [
{'Classification': 'yarn-site',
'Properties': {'yarn.nodemanager.vmem-check-enabled': 'false'}}]},
{'Name': 'core',
'InstanceRole': 'CORE',
'InstanceType': 'c1.medium',
'InstanceCount': 1,
'Configurations': [
{'Classification': 'yarn-site',
'Properties': {'yarn.nodemanager.vmem-check-enabled': 'false'}}]},
]},
Steps=[
{'Name': 'My word count example',
'HadoopJarStep': {
'Jar': 'command-runner.jar',
'Args': [
'hadoop-streaming',
'-files', 's3://mybucket/wordSplitter.py#wordSplitter.py',
'-mapper', 'python2.7 wordSplitter.py',
'-input', 's3://mybucket/input/',
'-output', 's3://mybucket/output/',
'-reducer', 'aggregate']}
}
],
JobFlowRole='EMR_EC2_DefaultRole',
ServiceRole='EMR_DefaultRole',
)
I don't remember needing to do this with boto, but I have had issues running the simple streaming job properly without disabling vmem-check-enabled
.
Also, if your script is located somewhere on S3, download it using -files
(appending #filename
to the argument make the downloaded file available as filename
in the cluster).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With