Pyspark on yarn-cluster mode

Tags:

Is there any way to run pyspark scripts with yarn-cluster mode without using the spark-submit script? I need it in this way because i will integrate this code into a django web app.

When i try to run any script in yarn-cluster mode i got the following error :

org.apache.spark.SparkException: Detected yarn-cluster mode, but isn't running on a cluster. Deployment to YARN is not supported directly by SparkContext. Please use spark-submit.

I'm creating the sparkContext in the following way :

        conf = (SparkConf()
            .setMaster("yarn-cluster")
            .setAppName("DataFrameTest"))

        sc = SparkContext(conf = conf)

        #Dataframe code ....

Thanks

505

asked Jul 09 '15 20:07

jegordon

Video Answer

1 Answers

The reason yarn-cluster mode isn't supported is that yarn-cluster means bootstrapping the driver-program itself (e.g. the program calling using a SparkContext) onto a YARN container. Guessing from your statement about submitting from a django web app, it sounds like you want the python code that contains the SparkContext to be embedded in the web app itself, rather than shipping the driver code onto a YARN container which then handles a separate spark job.

This means your case most closely fits with yarn-client mode instead of yarn-cluster; in yarn-client mode, you can run your SparkContext code anywhere (like inside your web app), while it talks to YARN for the actual mechanics of running jobs.

Fundamentally, if you're sharing any in-memory state between your web app and your Spark code, that means you won't be able to chop off the Spark portion to run inside a YARN container, which is what yarn-cluster tries to do. If you're not sharing state, then you can simply invoke a subprocess which actually does call spark-submit to bundle an independent PySpark job to run in yarn-cluster mode.

To summarize:

If you want to embed your Spark code directly in your web app, you need to use yarn-client mode instead: SparkConf().setMaster("yarn-client")
If the Spark code is loosely coupled enough that yarn-cluster is actually viable, you can issue a Python subprocess to actually invoke spark-submit in yarn-cluster mode.

answered Oct 22 '22 01:10

Dennis Huo

Related questions
                            
                                Spark SQL and MySQL- SaveMode.Overwrite not inserting modified data
                            
                                How to choose the queue for Spark job using spark-submit?
                            
                                Spark scala data frame udf returning rows
                            
                                How to create SQLContext in spark using scala?
                            
                                Spark (JAVA) - dataframe groupBy with multiple aggregations?
                            
                                Spark mapWithState API explanation
                            
                                Why spark tell me “ name 'sqlContext' is not defined ”, how can I use sqlContext?
                            
                                How to convert JavaPairInputDStream into DataSet/DataFrame in Spark
                            
                                Why does spark-shell fail with "'""C:\Program' is not recognized as an internal or external command" on Windows?
                            
                                How to zip two array columns in Spark SQL
                            
                                Spark SQL has no SparkSqlParser.scala file when compiling in intelliJ idea
                            
                                Spark dataframe save in single file on hdfs location [duplicate]
                            
                                How do I Convert Array[Row] to DataFrame
                            
                                Apache Spark (Structured Streaming) : S3 Checkpoint support
                            
                                How can you parse a string that is json from an existing temp table using PySpark?
                            
                                Why does posexplode fail with "AnalysisException: The number of aliases supplied in the AS clause does not match the number of columns..."?
                            
                                Spark 2.3.0 netty version issue: NoSuchMethod io.netty.buffer.PooledByteBufAllocator.metric()
                            
                                Meaning of Exchange in Spark Stage
                            
                                How to convert timestamp column to epoch seconds?
                            
                                'GroupedData' object has no attribute 'show' when doing doing pivot in spark dataframe

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pyspark on yarn-cluster mode

Tags:

apache-spark

pyspark

hadoop-yarn

jegordon

People also ask

Video Answer

1 Answers

Dennis Huo

Recent Activity

Donate For Us