Using a module with udf defined inside freezes pyspark job - explanation?

Tags:

Here is the situation:

We have a module where we define some functions that return pyspark.sql.DataFrame (DF). To get those DF we use some pyspark.sql.functions.udf defined either in the same file or in helper modules. When we actually write job for pyspark to execute we only import functions from modules (we provide a .zip file to --py-files) and then just save the dataframe to hdfs.

Issue is that when we do this, the udf function freezes our job. The nasty fix we found was to define udf functions inside the job and provide them to imported functions from our module. The other fix I found here is to define a class:

from pyspark.sql.functions import udf


class Udf(object):
    def __init__(s, func, spark_type):
        s.func, s.spark_type = func, spark_type

    def __call__(s, *args):
        return udf(s.func, s.spark_type)(*args)

Then use this to define my udf in module. This works!

Can anybody explain why we have this problem in the first place? And why this fix (the last one with the class definition) works?

Additional info: PySpark 2.1.0. Deploying job on yarn in cluster mode.

Thanks!

875

asked Jul 14 '17 05:07

Vaidas Armonas

1 Answers

The accepted answer to the link you posted above, says, "My work around is to avoid creating the UDF until Spark is running and hence there is an active SparkContext." Looks like your issue is with serializing the UDF.

Make sure the UDF functions in your helper classes are static methods or global functions. And inside the public functions that you import elsewhere, you can define the udf.

class Helperclass(object):
  @staticmethod
  def my_udf_todo(...):
     ...

  def public_function_that_is_imported_elsewhere(...):
     todo_udf = udf(Helperclass.my_udf_todo, RETURN_SCHEMA)
     ...

108

answered Oct 02 '22 06:10

greenie

Related questions
                            
                                How to zip two array columns in Spark SQL
                            
                                How can you parse a string that is json from an existing temp table using PySpark?
                            
                                'GroupedData' object has no attribute 'show' when doing doing pivot in spark dataframe
                            
                                Pyspark on yarn-cluster mode
                            
                                Spark DataFrame: Computing row-wise mean (or any aggregate operation)
                            
                                cleaning data with dropna in Pyspark
                            
                                How do I truncate a PySpark dataframe of timestamp type to the day?
                            
                                How to load jar dependenices in IPython Notebook
                            
                                Remove blank space from data frame column values in Spark
                            
                                Is there a spark-defaults.conf when installed with pip install pyspark
                            
                                Python vs Scala (for Spark jobs)
                            
                                PySpark: TypeError: 'Column' object is not callable
                            
                                pySpark: Get executor id
                            
                                Using pyspark, how do I read multiple JSON documents on a single line in a file into a dataframe?
                            
                                How to preserve milliseconds when converting a date and time string to timestamp using PySpark?
                            
                                Save spark model summary
                            
                                Reading data from S3 using pyspark throws java.lang.NumberFormatException: For input string: "100M"
                            
                                How Python interact with JVM inside Spark
                            
                                Is there a way to connecto Spark-Sql with sqlalchemy

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Using a module with udf defined inside freezes pyspark job - explanation?

Tags:

apache-spark-sql

pyspark

user-defined-functions

Vaidas Armonas

People also ask

1 Answers

greenie

Recent Activity

Donate For Us