I would like to develop a Scala application which connects a master and runs a spark piece of code. I would like to achieve this without using spark-submit. Is this possible? Particularly I would like to know if the following code can run from my machine and connect to a cluster:
val conf = new SparkConf()
.setAppName("Meisam")
.setMaster("yarn-client")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val df = sqlContext.sql("SELECT * FROM myTable")
...
No, but if you run on a cluster, you will need some form of shared file system (for example, NFS mounted at the same path on each node). If you have this type of filesystem, you can just deploy Spark in standalone mode.
A Spark program implicitly creates a logical directed acyclic graph (DAG) of operations. When the driver runs, it converts this logical graph into a physical execution plan. Here you can see that collect is an action that will collect all data and give a final result.
Using --master option, you specify what cluster manager to use to run your application. Spark currently supports Yarn, Mesos, Kubernetes, Stand-alone, and local.
add a conf
val conf = new SparkConf()
.setAppName("Meisam")
.setMaster("yarn-client")
.set("spark.driver.host", "127.0.0.1");
Yes, it's possible and basically what you did is all that's needed to have tasks running on YARN cluster in the client deploy mode (where the driver runs on the machine where the app runs).
spark-submit
helps you to leave your code free of few SparkConf
settings that are required for proper execution like master URL. When you keep your code free of the low-level details, you could deploy your Spark applications on any Spark cluster - YARN, Mesos, Spark Standalone and local - without recompiling them.
As opposed to what has been said here, I think it's only partially possible, as I've recently discovered the hard way, being the Spark newbie that I am. While you can definitely connect to a cluster as noted above and run code on it, you may encounter problems when you start doing anything non-trivial, even something as simple as using UDF's (user-defined-function, AKA anything not already included in Spark). Have a look here https://issues.apache.org/jira/browse/SPARK-18075, and the other related tickets, and most importantly, at the responses. Also, this seems useful (having a look at it now): Submitting spark app as a yarn job from Eclipse and Spark Context
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With