I am told there is a spark cluster running on "remote-host-num1:7077" with multiple nodes on "remote-host-num2:7077" "remote-host-num3:7077".
If I write a program that does the following:
SparkConf conf = new SparkConf().setAppName("org.sparkexample.TestCount").setMaster("spark://remote-host-num1:7077");
JavaSparkContext sc = new JavaSparkContext(conf);
and create JavaRDD "myrdd" from sc.textFile, and perform an operation like get its counts with "myrdd.count()". Is this operation taking advantage of all the machines in the remote cluster?
I want to make sure as I don't want to use spark-submit "myjarfile" if I can avoid it. If I have to, what should I be doing? If I have to use spark-submit to take advantage of the distributed nature of spark across multiple machines, is there a way to do this programatically in Java?
Yes, there was support added in spark-1.4.x for submitting scala/java spark apps as a child process. You can check for more details in the javaDocs for the org.apache.spark.launcher class. The link below is where it is referenced in the spark documentation.
https://spark.apache.org/docs/latest/programming-guide.html#launching-spark-jobs-from-java--scala
Question 1: Is this operation taking advantage of all the machines in the remote cluster?
Go to http://remote-host-num2:8080... This page helps you know the distributed nature of your spark cluster... How many workers are running? How many workers are currently active? etc
You can even submit a job and check this page, to see if the job is delegated to all workers. *For and operation like count it will most likely be distributed. Spark splits the job into stages and gives them to worker nodes to process.*
And looks like there are two spark-masters in the cluster hosted at remote-host-num2:7077 & remote-host-num3:7077 One of them will be selected as leader. Cluster management is not shared among them. If the current leader goes down, the other becomes the leader
Question 2: If I have to use spark-submit to take advantage of the distributed nature of spark across multiple machines, is there a way to do this programatically in Java?
You submit the job to the cluster. Since Spark works with RDDs, which are immutable by nature, the operations on they can be easily parallelised. As I said earlier, submit the job and see if the job is being processed by all workers.
see the documentation of spark-submit for all the options. For example: --executor-cores property lets you suggest number cores for the job.
Question 3: Is it possible to connect to a full fledged spark cluster with out spark-submit?
In the main method of your spark-application, populate spark config completely master url, deploymode, executor configuration, driver configuration etc.. And simply run your class.
I would suggest you to go with spark-submit. Spark is supported to run with multiple cluster managers(spark cluster, mesos and yarn currently)... The whole point of it is, develop your application which contains your business logic alone.. and then you can submit it on an environment of your choice.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With