I am working with an AWS Data Pipeline that has a ShellCommandActivity that sets the script uri to bash file located in a s3 bucket. The bash file copies a python script located in the same s3 bucket to a EmrCluster and then the script tries to execute that python script.
This is my pipeline export:
{
"objects": [
{
"name": "DefaultResource1",
"id": "ResourceId_27dLM",
"amiVersion": "3.9.0",
"type": "EmrCluster",
"region": "us-east-1"
},
{
"failureAndRerunMode": "CASCADE",
"resourceRole": "DataPipelineDefaultResourceRole",
"role": "DataPipelineDefaultRole",
"pipelineLogUri": "s3://project/bin/scripts/logs/",
"scheduleType": "ONDEMAND",
"name": "Default",
"id": "Default"
},
{
"stage": "true",
"scriptUri": "s3://project/bin/scripts/RunPython.sh",
"name": "DefaultShellCommandActivity1",
"id": "ShellCommandActivityId_hA57k",
"runsOn": {
"ref": "ResourceId_27dLM"
},
"type": "ShellCommandActivity"
}
],
"parameters": []
}
This is RunPython.sh:
#!/usr/bin/env bash
aws s3 cp s3://project/bin/scripts/Test.py ./
python ./Test.py
This is Test.py
__author__ = 'MrRobot'
import re
import os
import sys
import boto3
print "We've entered the python file"
From the Stdout Log I get:
download: s3://project/bin/scripts/Test.py to ./
From the Stdeer Log I get:
python: can't open file 'Test.py': [Errno 2] No such file or directory
I have also tried replacing python ./Test.py with python Test.py, but I get the same result.
How do I get my AWS Data Pipeline to execute my Test.py script.
EDIT
When I set scriptUri to s3://project/bin/scripts/Test.py I get the following errors :
/mnt/taskRunner/output/tmp/df-0947490M9EHH2Y32694-59ed8ca814264f5d9e65b2d52ce78a53/ShellCommandActivityIdJiZP720170209T175934Attempt1_command.sh: line 1: author: command not found /mnt/taskRunner/output/tmp/df-0947490M9EHH2Y32694-59ed8ca814264f5d9e65b2d52ce78a53/ShellCommandActivityIdJiZP720170209T175934Attempt1_command.sh: line 2: import: command not found /mnt/taskRunner/output/tmp/df-0947490M9EHH2Y32694-59ed8ca814264f5d9e65b2d52ce78a53/ShellCommandActivityIdJiZP720170209T175934Attempt1_command.sh: line 3: import: command not found /mnt/taskRunner/output/tmp/df-0947490M9EHH2Y32694-59ed8ca814264f5d9e65b2d52ce78a53/ShellCommandActivityIdJiZP720170209T175934Attempt1_command.sh: line 4: import: command not found /mnt/taskRunner/output/tmp/df-0947490M9EHH2Y32694-59ed8ca814264f5d9e65b2d52ce78a53/ShellCommandActivityIdJiZP720170209T175934Attempt1_command.sh: line 5: import: command not found /mnt/taskRunner/output/tmp/df-0947490M9EHH2Y32694-59ed8ca814264f5d9e65b2d52ce78a53/ShellCommandActivityIdJiZP720170209T175934Attempt1_command.sh: line 7: print: command not found
EDIT 2
Added the following line to Test.py
#!/usr/bin/env python
Then I received the following error:
error: line 6, in import boto3 ImportError: No module named boto3
using @franklinsijo 's advice I created a Bootstrap Action on the EmrCluster with the following value:
s3://project/bin/scripts/BootstrapActions.sh
This is BootstrapActions.sh
#!/usr/bin/env bash
sudo pip install boto3
This worked!!!!!!!
You can also invoke the AWS Data Pipeline activation API directly from the AWS CLI and SDK. To get started, create a new pipeline and use the default object to specify a property of 'scheduleType”:”ondemand”. Setting this parameter enables on-demand activation of the pipeline.
Configure ShellCommandActivity with
Script Uri
.#!/usr/bin/env python
in the
script.runsOn
is chosen, Add the installation commands as the bootstrap action for the EMR Resource.workerGroup
is chosen, Install all the libraries on the Worker group before pipeline activation.Use either pip
or easy_install
to install the python modules.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With