I have spark running in cluster (Remote)
How do I submit application using spark-submit to remote cluster with following scenerio:
spark-submit is executed as command via camel
the application runs in its own container.
From the following links:
https://github.com/mvillarrealb/docker-spark-cluster
https://github.com/big-data-europe/docker-spark
we can submit spark applications but we have copy the files and jars to the volumes.
How do I avoid this?
Is there any way?
The easiest way to do this is using a livy rest server running on the spark master node. This allows you to submit a job just by packaging it locally and use a submit rest api. Livy come now by default with a lot of spark cloud providers. (AWS , Azure, Hortonworks) See doc
I still believe submitting should be possible just by installing the same spark drivers locally. However i gave up on this. Especially if yarn is used, i could not find a proper config and what ports to connect.
Actually this is also not a good ops setup, because your machine then needs to participate in the clusters network or have specific ports open. And your local machine also start participating in the spark protocol.
Deploying the code to a temp location on the cluster then user spark-submit or use a well defined livy api endpoint is a good way to go.
Update regarding a comment about a connection within a cluster:
Within a cluster of spark machines and proper drivers installed on each machine one can submit jobs from any machine. Also within a cluster admins leave ports open to all participating machines.
the spark-submit command has a master-url parameter. This url must use the spark protocol:
./bin/spark-submit \
--class <main-class \
--master <master-url> \
<application-jar>
Without dns and yarn, a master url looks like this - spark://192.168.1.1:7077 (spark protocol, ip of master node/vm, port)
I have made a similar setup with docker-compose. https://github.com/dre-hh/spark_playground
NOTE: docker-compose automatically comes with dns, so i don`t have to reference the nodes by ip.
# "spark-master" will automatically resolve to the ip of the master node because of docker-compose naming convention and dns rules
pyspark.SparkContext(master="spark://spark-master:7077", appName="Pi")
https://github.com/dre-hh/spark_playground/blob/master/docker-compose.yml#L48
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With