Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to run spark-submit remotely?

I have spark running in cluster (Remote)

How do I submit application using spark-submit to remote cluster with following scenerio:

  1. spark-submit is executed as command via camel

  2. the application runs in its own container.

From the following links:

https://github.com/mvillarrealb/docker-spark-cluster

https://github.com/big-data-europe/docker-spark

we can submit spark applications but we have copy the files and jars to the volumes.

How do I avoid this?

Is there any way?

like image 590
Pkumar Avatar asked Nov 28 '19 14:11

Pkumar


1 Answers

The easiest way to do this is using a livy rest server running on the spark master node. This allows you to submit a job just by packaging it locally and use a submit rest api. Livy come now by default with a lot of spark cloud providers. (AWS , Azure, Hortonworks) See doc

I still believe submitting should be possible just by installing the same spark drivers locally. However i gave up on this. Especially if yarn is used, i could not find a proper config and what ports to connect.

Actually this is also not a good ops setup, because your machine then needs to participate in the clusters network or have specific ports open. And your local machine also start participating in the spark protocol.

Deploying the code to a temp location on the cluster then user spark-submit or use a well defined livy api endpoint is a good way to go.

Update regarding a comment about a connection within a cluster:

Within a cluster of spark machines and proper drivers installed on each machine one can submit jobs from any machine. Also within a cluster admins leave ports open to all participating machines.

the spark-submit command has a master-url parameter. This url must use the spark protocol:

./bin/spark-submit \
  --class <main-class \
  --master <master-url> \ 
  <application-jar>

Without dns and yarn, a master url looks like this - spark://192.168.1.1:7077 (spark protocol, ip of master node/vm, port)

I have made a similar setup with docker-compose. https://github.com/dre-hh/spark_playground

  • There a 3 types of nodes with a self-documenting name: spark-master, spark-worker and spark-submit.
  • The appcode is only deployed to the spark-submit node by the build . command. This is the only docker image which is build locally. It inherits from the spark-image. Hence it has exactly the same spark drivres as other node. additionally it copies all the project code from the git repo (including the job) into a particular folder on the node.
  • All the other nodes are build from official images on the docker registry and are left unchanged (except some configs).
  • Finally spark-submit could be used from the spark-submit node. However in this example i have just launched an interactive jupyter notebook and connected from the app code itself.

NOTE: docker-compose automatically comes with dns, so i don`t have to reference the nodes by ip.

 # "spark-master" will automatically resolve to the ip of the master node because of docker-compose naming convention and dns rules
 pyspark.SparkContext(master="spark://spark-master:7077", appName="Pi")

https://github.com/dre-hh/spark_playground/blob/master/docker-compose.yml#L48

like image 76
dre-hh Avatar answered Oct 16 '22 18:10

dre-hh