Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to submit Spark jobs to EMR cluster from Airflow?

How can I establish a connection between EMR master cluster(created by Terraform) and Airflow. I have Airflow setup under AWS EC2 server with same SG,VPC and Subnet.

I need solutions so that Airflow can talk to EMR and execute Spark submit.

https://aws.amazon.com/blogs/big-data/build-a-concurrent-data-orchestration-pipeline-using-amazon-emr-and-apache-livy/

These blogs have understanding on execution after connection has been established.(Didn't help much)

In airflow I have made a connection using UI for AWS and EMR:-

enter image description here

Below is the code which will list the EMR cluster's which are Active and Terminated, I can also fine tune to get Active Clusters:-

from airflow.contrib.hooks.aws_hook import AwsHook
import boto3
hook = AwsHook(aws_conn_id=‘aws_default’)
    client = hook.get_client_type(‘emr’, ‘eu-central-1’)
    for x in a:
        print(x[‘Status’][‘State’],x[‘Name’])

My question is - How can I update my above code can do Spark-submit actions

like image 977
asur Avatar asked Jan 03 '19 12:01

asur


People also ask

How do I submit Spark jobs to EMR cluster?

To submit a Spark step using the console Open the Amazon EMR console at https://console.aws.amazon.com/elasticmapreduce/ . In the Cluster List, choose the name of your cluster. Scroll to the Steps section and expand it, then choose Add step.

How do you create an EMR cluster with airflow?

Upload the DAG to the Airflow S3 bucket's dags directory. Substitute your Airflow S3 bucket name in the AWS CLI command below, then run it from the project's root. The DAG, spark_pi_example , should automatically appear in the Airflow UI. Click on 'Trigger DAG' to create a new EMR cluster and start the Spark job.


Video Answer


1 Answers

While it may not directly address your particular query, broadly, here are some ways you can trigger spark-submit on (remote) EMR via Airflow

  1. Use Apache Livy

    • This solution is actually independent of remote server, i.e., EMR
    • Here's an example
    • The downside is that Livy is in early stages and its API appears incomplete and wonky to me
  2. Use EmrSteps API

    • Dependent on remote system: EMR
    • Robust, but since it is inherently async, you will also need an EmrStepSensor (alongside EmrAddStepsOperator)
    • On a single EMR cluster, you cannot have more than one steps running simultaneously (although some hacky workarounds exist)
  3. Use SSHHook / SSHOperator

    • Again independent of remote system
    • Comparatively easier to get started with
    • If your spark-submit command involves a lot of arguments, building that command (programmatically) can become cumbersome

EDIT-1

There seems to be another straightforward way

  1. Specifying remote master-IP

    • Independent of remote system
    • Needs modifying Global Configurations / Environment Variables
    • See @cricket_007's answer for details

Useful links

  • This one is from @Kaxil Naik himself: Is there a way to submit spark job on different server running master
  • Spark job submission using Airflow by submitting batch POST method on Livy and tracking job
  • Remote spark-submit to YARN running on EMR
like image 141
y2k-shubham Avatar answered Nov 15 '22 07:11

y2k-shubham