How to run Spark on Docker?

2 Answers

This error sounds like the workers have not registered with the master.

This can be checked at the master's spark web stool http://<masterip>:8080

You could also simply use a different docker image, or compare docker images with one that works and see what is different.

I have dockerized a spark master and spark worker.

If you have a Linux machine sitting behind a NAT router, like a home firewall, that allocates addresses in the private 192.168.1.* network to the machines, this script will download a spark 1.3.1 master and a worker to run in separate docker containers with addresses 192.168.1.10 and .11 respectively. You may need to tweak the addresses if 192.168.1.10 and 192.168.1.11 are already used on your LAN.

pipework is a utility for bridging the LAN to the container instead of using the internal docker bridge.

Spark requires all of the machines to be able to communicate with each other. As far as I can tell, spark is not hierarchical, I've seen the workers try to open ports to each other. So in the shell script I expose all the ports, which is OK if the machines are otherwise firewalled, such as behind a home NAT router.

./run-docker-spark

#!/bin/bash
sudo -v
MASTER=$(docker run --name="master" -h master --add-host master:192.168.1.10 --add-host spark1:192.168.1.11 --add-host spark2:192.168.1.12 --add-host spark3:192.168.1.13 --add-host spark4:192.168.1.14 --expose=1-65535 --env SPARK_MASTER_IP=192.168.1.10 -d drpaulbrewer/spark-master:latest)
sudo pipework eth0 $MASTER 192.168.1.10/[email protected]
SPARK1=$(docker run --name="spark1" -h spark1 --add-host home:192.168.1.8 --add-host master:192.168.1.10 --add-host spark1:192.168.1.11 --add-host spark2:192.168.1.12 --add-host spark3:192.168.1.13 --add-host spark4:192.168.1.14 --expose=1-65535 --env mem=10G --env master=spark://192.168.1.10:7077 -v /data:/data -v /tmp:/tmp -d drpaulbrewer/spark-worker:latest)
sudo pipework eth0 $SPARK1 192.168.1.11/[email protected]

After running this script I can see the master web report at 192.168.1.10:8080, or go to another machine on my LAN that has a spark distribution, and run ./spark-shell --master spark://192.168.1.10:7077 and it will bring up an interactive scala shell.

189

answered Oct 18 '22 12:10

Paul

Second is more common reason for docker case. You should check, that you

Expose all necessary ports
Set correct spark.broadcast.factory
Handle docker aliases

Without handling all 3 issues spark cluster parts(master, worker, driver) can't communicate. You can read closely on every issue on http://sometechshit.blogspot.ru/2015/04/running-spark-standalone-cluster-in.html or use container ready for spark from https://registry.hub.docker.com/u/epahomov/docker-spark/

If problem in resources, try to allocate less resources(number of executors, memory, cores) with flags from https://spark.apache.org/docs/latest/configuration.html. Check how much resources do you have on spark master UI page, which is http://localhost:8080 by default.

answered Oct 18 '22 13:10

epahomov

Related questions
                            
                                How to manually set group.id and commit kafka offsets in spark structured streaming?
                            
                                Use of lit() in expr()
                            
                                How to set group.id for consumer group in kafka data source in Structured Streaming?
                            
                                Can SPARK use multicore properly?
                            
                                Pass array as an UDF parameter in Spark SQL
                            
                                How does Spark on Yarn store shuffled files?
                            
                                Setting spark classpaths on EC2: spark.driver.extraClassPath and spark.executor.extraClassPath
                            
                                Basic Spark example not working
                            
                                winutils.exe chmod command doesn't set permission
                            
                                How to iterate scala wrappedArray? (Spark)
                            
                                sparkSession/sparkContext can not get hadoop configuration
                            
                                How to create Spark Dataset or Dataframe from case classes that contains Enums
                            
                                Spark 2.0 implicit encoder, deal with missing column when type is Option[Seq[String]] (scala)
                            
                                Cumulate arrays from earlier rows (PySpark dataframe)
                            
                                Dropping empty DataFrame partitions in Apache Spark
                            
                                How to merge pyspark and pandas dataframes
                            
                                What is Project node in execution query plan?
                            
                                How to get the size of an RDD in Pyspark?
                            
                                Installing PySpark
                            
                                Mllib dependency error

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to run Spark on Docker?

Tags:

docker

apache-spark

Nina Bardashova

People also ask

2 Answers

Paul

epahomov

Recent Activity

Donate For Us