Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Pyspark on yarn-cluster mode

Is there any way to run pyspark scripts with yarn-cluster mode without using the spark-submit script? I need it in this way because i will integrate this code into a django web app.

When i try to run any script in yarn-cluster mode i got the following error :

org.apache.spark.SparkException: Detected yarn-cluster mode, but isn't running on a cluster. Deployment to YARN is not supported directly by SparkContext. Please use spark-submit.

I'm creating the sparkContext in the following way :

        conf = (SparkConf()
            .setMaster("yarn-cluster")
            .setAppName("DataFrameTest"))

        sc = SparkContext(conf = conf)

        #Dataframe code ....

Thanks

like image 505
jegordon Avatar asked Jul 09 '15 20:07

jegordon


People also ask

How do you run Spark in cluster mode in PySpark?

If you're not sharing state, then you can simply invoke a subprocess which actually does call spark-submit to bundle an independent PySpark job to run in yarn-cluster mode. To summarize: If you want to embed your Spark code directly in your web app, you need to use yarn-client mode instead: SparkConf().

How do I submit a Spark job in YARN cluster mode?

In "cluster" mode, the framework launches the driver inside of the cluster. In "client" mode, the submitter launches the driver outside of the cluster. You'll get the application id being the handle to your application. You should use yarn application -status command to check the status of a Spark application.

How do you run a Spark in YARN mode?

Launching Spark on YARN Ensure that HADOOP_CONF_DIR or YARN_CONF_DIR points to the directory which contains the (client side) configuration files for the Hadoop cluster. These configs are used to write to HDFS and connect to the YARN ResourceManager.


Video Answer


1 Answers

The reason yarn-cluster mode isn't supported is that yarn-cluster means bootstrapping the driver-program itself (e.g. the program calling using a SparkContext) onto a YARN container. Guessing from your statement about submitting from a django web app, it sounds like you want the python code that contains the SparkContext to be embedded in the web app itself, rather than shipping the driver code onto a YARN container which then handles a separate spark job.

This means your case most closely fits with yarn-client mode instead of yarn-cluster; in yarn-client mode, you can run your SparkContext code anywhere (like inside your web app), while it talks to YARN for the actual mechanics of running jobs.

Fundamentally, if you're sharing any in-memory state between your web app and your Spark code, that means you won't be able to chop off the Spark portion to run inside a YARN container, which is what yarn-cluster tries to do. If you're not sharing state, then you can simply invoke a subprocess which actually does call spark-submit to bundle an independent PySpark job to run in yarn-cluster mode.

To summarize:

  1. If you want to embed your Spark code directly in your web app, you need to use yarn-client mode instead: SparkConf().setMaster("yarn-client")
  2. If the Spark code is loosely coupled enough that yarn-cluster is actually viable, you can issue a Python subprocess to actually invoke spark-submit in yarn-cluster mode.
like image 58
Dennis Huo Avatar answered Oct 22 '22 01:10

Dennis Huo