How can I establish a connection between EMR master cluster(created by Terraform) and Airflow. I have Airflow setup under AWS EC2 server with same SG,VPC and Subnet.
I need solutions so that Airflow can talk to EMR and execute Spark submit.
https://aws.amazon.com/blogs/big-data/build-a-concurrent-data-orchestration-pipeline-using-amazon-emr-and-apache-livy/
These blogs have understanding on execution after connection has been established.(Didn't help much)
In airflow I have made a connection using UI for AWS and EMR:-
Below is the code which will list the EMR cluster's which are Active and Terminated, I can also fine tune to get Active Clusters:-
from airflow.contrib.hooks.aws_hook import AwsHook
import boto3
hook = AwsHook(aws_conn_id=‘aws_default’)
client = hook.get_client_type(‘emr’, ‘eu-central-1’)
for x in a:
print(x[‘Status’][‘State’],x[‘Name’])
My question is - How can I update my above code can do Spark-submit actions
To submit a Spark step using the console Open the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce/ . In the Cluster List, choose the name of your cluster. Scroll to the Steps section and expand it, then choose Add step.
Upload the DAG to the Airflow S3 bucket's dags directory. Substitute your Airflow S3 bucket name in the AWS CLI command below, then run it from the project's root. The DAG, spark_pi_example , should automatically appear in the Airflow UI. Click on 'Trigger DAG' to create a new EMR cluster and start the Spark job.
While it may not directly address your particular query, broadly, here are some ways you can trigger spark-submit
on (remote) EMR
via Airflow
Use Apache Livy
EMR
Livy
is in early stages and its API
appears incomplete and wonky to meUse EmrSteps
API
EMR
EmrStepSensor
(alongside EmrAddStepsOperator
)EMR
cluster, you cannot have more than one steps running simultaneously (although some hacky workarounds exist)Use SSHHook
/ SSHOperator
spark-submit
command involves a lot of arguments, building that command (programmatically) can become cumbersomeEDIT-1
There seems to be another straightforward way
Specifying remote master
-IP
Useful links
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With