Session isn't active Pyspark in an AWS EMR cluster

Tags:

amazon-emr

I have opened an AWS EMR cluster and in pyspark3 jupyter notebook I run this code:

"..
textRdd = sparkDF.select(textColName).rdd.flatMap(lambda x: x)
textRdd.collect().show()
.."

I got this error:

An error was encountered:
Invalid status code '400' from http://..../sessions/4/statements/7 with error payload: {"msg":"requirement failed: Session isn't active."}

Running the line:

sparkDF.show()

works!

I also created a small subset of the file and all my code runs fine.

What is the problem?

762

asked Sep 23 '19 12:09

Video Answer

2 Answers

I had the same issue and the reason for the timeout is the driver running out of memory. Since you run collect() all the data gets sent to the driver. By default the driver memory is 1000M when creating a spark application through JupyterHub even if you set a higher value through config.json. You can see that by executing the code from within a jupyter notebook

spark.sparkContext.getConf().get('spark.driver.memory')

1000M

To increase the driver memory just do

%%configure -f 
{"driverMemory": "6000M"}

This will restart the application with increased driver memory. You might need to use higher values for your data. Hope it helps.

answered Sep 21 '22 22:09

Koba

From This stack overflow question's answer which worked for me

Judging by the output, if your application is not finishing with a FAILED status, that sounds like a Livy timeout error: your application is likely taking longer than the defined timeout for a Livy session (which defaults to 1h), so even despite the Spark app succeeds your notebook will receive this error if the app takes longer than the Livy session's timeout.

If that's the case, here's how to address it:

1. edit the /etc/livy/conf/livy.conf file (in the cluster's master node)
2. set the livy.server.session.timeout to a higher value, like 8h (or larger, depending on your app)
3. restart Livy to update the setting: sudo restart livy-server in the cluster's master
4. test your code again

Alternative way to edit this setting - https://allinonescript.com/questions/54220381/how-to-set-livy-server-session-timeout-on-emr-cluster-boostrap

answered Sep 20 '22 22:09

Nithish Inpursuit Ofhappiness

Related questions
                            
                                Converting a dataframe into JSON (in pyspark) and then selecting desired fields
                            
                                How to re-partition pyspark dataframe?
                            
                                How to sum the values of a column in pyspark dataframe
                            
                                unable to install pyspark
                            
                                Pyspark alter column with substring
                            
                                Pyspark:How to calculate avg and count in a single groupBy? [duplicate]
                            
                                Convert timestamp to date in Spark dataframe
                            
                                How to find max value in pair RDD?
                            
                                Creating a Spark DataFrame from an RDD of lists
                            
                                How do you create merge_asof functionality in PySpark?
                            
                                pyspark using one task for mapPartitions when converting rdd to dataframe
                            
                                Spark is only using one worker machine when more are available
                            
                                If I cache a Spark Dataframe and then overwrite the reference, will the original data frame still be cached?
                            
                                Extract document-topic matrix from Pyspark LDA Model
                            
                                Why spark.ml don't implement any of spark.mllib algorithms?
                            
                                Preserve index-string correspondence spark string indexer
                            
                                How can set the default spark logging level?
                            
                                Meaning of Apache Spark warning "Calling spill() on RowBasedKeyValueBatch"
                            
                                What is the right way to save\load models in Spark\PySpark
                            
                                How to run independent transformations in parallel using PySpark?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Session isn't active Pyspark in an AWS EMR cluster

Tags:

pyspark

amazon-emr

anat

People also ask

Video Answer

2 Answers

Koba

Nithish Inpursuit Ofhappiness

Recent Activity

Donate For Us