Currently I am running my program as
val conf = new SparkConf() .setAppName("Test Data Analysis") .setMaster("local[*]") .set("spark.executor.memory", "32g") .set("spark.driver.memory", "32g") .set("spark.driver.maxResultSize", "4g")
Even though I am running on a cluster of 5 machines (each with 376 GB Physical RAM). my program errors out with java.lang.OutOfMemoryError: Java heap space
My data sizes are big... but not so big that they exceed 32 GB Executor memory * 5 nodes.
I suspect it may be because I am using "local" as my master. I have seen documentation say use spark://machinename:7070
However I want to know for my cluster... how do I determine this URL and port
EDIT: I can see that the documentation talks about running something called "spark-master.sh" in order to make a node as master.
in my case the spark cluster was setup/maintained by someone else and so I don't want to change topology by starting my own master.
How can I query and find out which node is the existing master.
I already tried picking up a random node in the cluster and then try 'spark://node:7077' but this does not work and gives error
[15/11/03 20:06:21 WARN AppClient$ClientActor: Could not connect to akka.tcp://sparkMaster@node:7077: akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkMaster@node:7077]
You can also find this URL on the master's web UI, which is http://localhost:8080 by default. Once you have started a worker, look at the master's web UI (http://localhost:8080 by default). You should see the new node listed there, along with its number of CPUs and memory (minus one gigabyte left for the OS).
In cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN.
The Spark Master is the process that requests resources in the cluster and makes them available to the Spark Driver. In all deployment modes, the Master negotiates resources or containers with Worker nodes or slave nodes and tracks their status and monitors their progress.
I found that doing --master yarn-cluster
works best. this makes sure that spark uses all the nodes of the hadoop cluster.
You are on the spot. .setMaster("local[*]")
will run spark in self-contained mode. In this mode spark can utilize only the resources of the local machine.
If you've already set up a spark cluster on top of your physical cluster. The solution is an easy one, Check http://master:8088
where master is pointing to spark master machine. There you can see spark master URI, and by default is spark://master:7077
, actually quite a bit of information lives there, if you have a spark standalone cluster.
However, I see a lot of questions on SO claiming this does not work with many different reasons. Using spark-submit
utility is just less error prone, See usage.
But if you haven't got a spark cluster yet I suggest setting up a Spark Standalone cluster first.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With