Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find the master URL for an existing spark cluster

Tags:

apache-spark

Currently I am running my program as

val conf = new SparkConf()   .setAppName("Test Data Analysis")   .setMaster("local[*]")   .set("spark.executor.memory", "32g")   .set("spark.driver.memory", "32g")   .set("spark.driver.maxResultSize", "4g") 

Even though I am running on a cluster of 5 machines (each with 376 GB Physical RAM). my program errors out with java.lang.OutOfMemoryError: Java heap space

My data sizes are big... but not so big that they exceed 32 GB Executor memory * 5 nodes.

I suspect it may be because I am using "local" as my master. I have seen documentation say use spark://machinename:7070

However I want to know for my cluster... how do I determine this URL and port

EDIT: I can see that the documentation talks about running something called "spark-master.sh" in order to make a node as master.

in my case the spark cluster was setup/maintained by someone else and so I don't want to change topology by starting my own master.

How can I query and find out which node is the existing master.

I already tried picking up a random node in the cluster and then try 'spark://node:7077' but this does not work and gives error

[15/11/03 20:06:21 WARN AppClient$ClientActor: Could not connect to  akka.tcp://sparkMaster@node:7077:  akka.remote.EndpointAssociationException: Association failed with  [akka.tcp://sparkMaster@node:7077] 
like image 322
Knows Not Much Avatar asked Nov 03 '15 16:11

Knows Not Much


People also ask

Where is my Spark cluster URL?

You can also find this URL on the master's web UI, which is http://localhost:8080 by default. Once you have started a worker, look at the master's web UI (http://localhost:8080 by default). You should see the new node listed there, along with its number of CPUs and memory (minus one gigabyte left for the OS).

What is the master of the Spark application?

In cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN.

What is master in Spark?

The Spark Master is the process that requests resources in the cluster and makes them available to the Spark Driver. In all deployment modes, the Master negotiates resources or containers with Worker nodes or slave nodes and tracks their status and monitors their progress.


2 Answers

I found that doing --master yarn-cluster works best. this makes sure that spark uses all the nodes of the hadoop cluster.

like image 80
Knows Not Much Avatar answered Sep 28 '22 15:09

Knows Not Much


You are on the spot. .setMaster("local[*]") will run spark in self-contained mode. In this mode spark can utilize only the resources of the local machine.

If you've already set up a spark cluster on top of your physical cluster. The solution is an easy one, Check http://master:8088 where master is pointing to spark master machine. There you can see spark master URI, and by default is spark://master:7077, actually quite a bit of information lives there, if you have a spark standalone cluster.

However, I see a lot of questions on SO claiming this does not work with many different reasons. Using spark-submit utility is just less error prone, See usage.

But if you haven't got a spark cluster yet I suggest setting up a Spark Standalone cluster first.

like image 38
mehmetminanc Avatar answered Sep 28 '22 14:09

mehmetminanc