Currently I am running my program as <pre class="prettyprint"><code>val conf = new SparkConf() .setAppName("Test Data Analysis") .setMaster("local[*]") .set("spark.executor.memory", "32g") .set("spark.driver.memory", "32g") .set("spark.driver.maxResultSize", "4g") </code></pre> Even though I am running on a cluster of 5 machines (each with 376 GB Physical RAM). my program errors out with <code>java.lang.OutOfMemoryError: Java heap space</code> My data sizes are big... but not so big that they exceed 32 GB Executor memory * 5 nodes. I suspect it may be because I am using "local" as my master. I have seen documentation say use <code>spark://machinename:7070</code> However I want to know for my cluster... how do I determine this URL and port EDIT: I can see that the documentation talks about running something called "spark-master.sh" in order to make a node as master. in my case the spark cluster was setup/maintained by someone else and so I don't want to change topology by starting my own master. How can I query and find out which node is the existing master. I already tried picking up a random node in the cluster and then try 'spark://node:7077' but this does not work and gives error <pre class="prettyprint"><code>[15/11/03 20:06:21 WARN AppClient$ClientActor: Could not connect to akka.tcp://sparkMaster@node:7077: akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkMaster@node:7077] </code></pre>

I found that doing <code>--master yarn-cluster</code> works best. this makes sure that spark uses all the nodes of the hadoop cluster.

You are on the spot. <code>.setMaster("local[*]")</code> will run spark in self-contained mode. In this mode spark can utilize only the resources of the local machine. If you've already set up a spark cluster on top of your physical cluster. The solution is an easy one, Check <code>http://master:8088</code> where master is pointing to spark master machine. There you can see spark master URI, and by default is <code>spark://master:7077</code>, actually quite a bit of information lives there, if you have a spark standalone cluster. However, I see a lot of questions on SO claiming this does not work with many different reasons. Using <code>spark-submit</code> utility is just less error prone, See usage. But if you haven't got a spark cluster yet I suggest setting up a Spark Standalone cluster first.

How to find the master URL for an existing spark cluster

Tags:

apache-spark

Currently I am running my program as

val conf = new SparkConf()   .setAppName("Test Data Analysis")   .setMaster("local[*]")   .set("spark.executor.memory", "32g")   .set("spark.driver.memory", "32g")   .set("spark.driver.maxResultSize", "4g")

Even though I am running on a cluster of 5 machines (each with 376 GB Physical RAM). my program errors out with java.lang.OutOfMemoryError: Java heap space

My data sizes are big... but not so big that they exceed 32 GB Executor memory * 5 nodes.

I suspect it may be because I am using "local" as my master. I have seen documentation say use spark://machinename:7070

However I want to know for my cluster... how do I determine this URL and port

EDIT: I can see that the documentation talks about running something called "spark-master.sh" in order to make a node as master.

in my case the spark cluster was setup/maintained by someone else and so I don't want to change topology by starting my own master.

How can I query and find out which node is the existing master.

I already tried picking up a random node in the cluster and then try 'spark://node:7077' but this does not work and gives error

[15/11/03 20:06:21 WARN AppClient$ClientActor: Could not connect to  akka.tcp://sparkMaster@node:7077:  akka.remote.EndpointAssociationException: Association failed with  [akka.tcp://sparkMaster@node:7077]

322

asked Nov 03 '15 16:11

Knows Not Much

2 Answers

I found that doing --master yarn-cluster works best. this makes sure that spark uses all the nodes of the hadoop cluster.

answered Sep 28 '22 15:09

Knows Not Much

You are on the spot. .setMaster("local[*]") will run spark in self-contained mode. In this mode spark can utilize only the resources of the local machine.

If you've already set up a spark cluster on top of your physical cluster. The solution is an easy one, Check http://master:8088 where master is pointing to spark master machine. There you can see spark master URI, and by default is spark://master:7077, actually quite a bit of information lives there, if you have a spark standalone cluster.

However, I see a lot of questions on SO claiming this does not work with many different reasons. Using spark-submit utility is just less error prone, See usage.

But if you haven't got a spark cluster yet I suggest setting up a Spark Standalone cluster first.

answered Sep 28 '22 14:09

mehmetminanc

Related questions
                            
                                Spark dataframe get column value into a string variable
                            
                                Differences between null and NaN in spark? How to deal with it?
                            
                                Best Practice to launch Spark Applications via Web Application?
                            
                                Caused by: ERROR XSDB6: Another instance of Derby may have already booted the database
                            
                                Explode in PySpark
                            
                                Iterate rows and columns in Spark dataframe
                            
                                Apache Hadoop Yarn - Underutilization of cores
                            
                                How to save a spark DataFrame as csv on disk?
                            
                                How to use AND or OR condition in when in Spark
                            
                                Read multiline JSON in Apache Spark
                            
                                Map can not be serializable in scala?
                            
                                Trim string column in PySpark dataframe
                            
                                SparkSQL: How to deal with null values in user defined function?
                            
                                How spark read a large file (petabyte) when file can not be fit in spark's main memory
                            
                                Pyspark: get list of files/directories on HDFS path
                            
                                Create spark dataframe schema from json schema representation
                            
                                Apache Spark: Splitting Pair RDD into multiple RDDs by key to save values
                            
                                Spark / Scala: forward fill with last observation
                            
                                How do I stop a spark streaming job?
                            
                                Spark final task takes 100x times longer than first 199, how to improve

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to find the master URL for an existing spark cluster

Tags:

apache-spark

Knows Not Much

People also ask

2 Answers

Knows Not Much

mehmetminanc

Recent Activity

Donate For Us