Pyspark udf high memory utilization

Tags:

pyspark

I'm using a UDF written in python in order to change the base of a number.

So I read a parquet file and write to a parquet file and applying the UDF. Here is the line I run:

input_df.withColumn("origin_base", convert_2_dest_base(input_df.origin_base)).write.mode('overwrite').parquet(destination_path)

This conversion makes spark to utilize a lot of memory and I get this kind of warnings:

17/06/18 08:05:39 WARN TaskSetManager: Lost task 40.0 in stage 4.0 (TID 183, ip-10-100-5-196.ec2.internal, executor 19): ExecutorLostFailure (executor 19 exited caused by one of the running tasks) Reason: Container killed by YARN for exceeding memory limits. 4.4 GB of 4.4 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead.

and in the end it fails.

Does UDF is not the right approach? Why is it consuming so much memory?

312

asked Jun 18 '17 08:06

1 Answers

For pyspark, data is processed in Python and cached / shuffled in the JVM. If you are using built in Python API, there would not be much difference in terms of performance to scala. See python vs scala performance

enter image description here

When you use udf, since your local defined function does not registered in native JVM structure and so can't be implemented by simple java API call, it has to be serialize/deserialize to Python worker. Then data will be processed in Python worker and serialize/deserialize back to JVM.

The Python worker now need to process the serialized data in the off-heap memory, it consumes huge off-heap memory and so it often leads to memoryOverhead.

Performance wise, serialization is slow and it is often the key for performance tuning.

146

answered Oct 19 '22 21:10

Wong Tat Yau

Related questions
                            
                                SparkContext class not found error
                            
                                Pyspark append executor environment variable
                            
                                Testing Spark with pytest - cannot run Spark in local mode
                            
                                SparkSession and context confusion
                            
                                Spark Python: Standard scaler error "Do not support ... SparseVector"
                            
                                is there any pyspark function for add next month like DATE_ADD(date, month(int type))
                            
                                What is the use of queryExecution in spark dataframe?
                            
                                Apache Spark UDF that returns dynamic data types
                            
                                How to save bucketed DataFrame?
                            
                                how to list spark-packages added to the spark context?
                            
                                UDF to map words to term Index in Spark
                            
                                how does YARN "Fair Scheduler" work with spark-submit configuration parameter
                            
                                how to change column value in spark sql
                            
                                How to write streaming dataset to Kafka?
                            
                                Kafka with Spark 2.1 Structured Streaming - cannot deserialize
                            
                                I am getting an error while creating a simple RDD in Spark
                            
                                Spark Pipeline error
                            
                                spring autoconfiguration class is missing in META-INF/spring.factories
                            
                                NoClassDefFoundError: Could not initialize XXX class after deploying on spark standalone cluster
                            
                                How to cache partitioned dataset and use in multiple queries?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pyspark udf high memory utilization

Tags:

apache-spark

pyspark

Gluz

People also ask

1 Answers

Wong Tat Yau

Recent Activity

Donate For Us