Pyspark User-Defined_functions inside of a class

Tags:

I am trying to create a Spark-UDF inside of a python class. Meaning, one of the methods in a class is the UDF. I am getting an error named " PicklingError: Could not serialize object: TypeError: can't pickle _MovedItems objects "

Environment : Azure Databricks . (DBR version 6.1 Beta) Code execution : In the built in Notebook. Python version : 3.5 Spark version : 2.4.4

I have tried defining the UDF outside of the class in a separate cell, and the UDF works. I do not want to write code like that, I need to follow OOP principles and would like to keep it structured. I have tried everything on Google, did not help. In fact I did not even get the information about the error I am getting. " PicklingError: Could not serialize object: TypeError: can't pickle _MovedItems objects "

class phases():
  def __init__(self, each_mp_pair_df_as_arg, unique_mp_pair_df_as_arg):
    print("Inside the constructor of Class phases ")

    #I need the below 2 variables to be used in my UDF, so i am trying to put 
    them in a class
    self.each_mp_pair_phases_df = each_mp_pair_df_as_arg
    self.unique_mp_pair_phases_df = unique_mp_pair_df_as_arg

  #This is the UDF. 
  def phases_commence(self,each_row):
    print(a)
    return 1

  #This is the function that registers the UDF, 
  def initiate_the_phases_on_the_major_track_segment(self):
    print("Inside the 'initiate_the_phases_on_the_major_track_segment()'")

    #registering the UDF
    self.phases_udf = udf(self.phases_commence,LongType())
    new_df = self.each_mp_pair_phases_df.withColumn("status", self.phases_udf((struct([self.each_mp_pair_phases_df[x] for x in self.each_mp_pair_phases_df.columns]))))
    display(new_df)

#This is a method in a different notebook that creates an object for the above shown class and calls the methods that registers the UDF.
def getting_ready_for_the_phases(each_mp_pair_df_as_arg, unique_mp_pair_df_as_arg):

  phase_obj = phases(each_mp_pair_df_as_arg, unique_mp_pair_df_as_arg)
  phase_obj.initiate_the_phases_on_the_major_track_segment()

The error message is: PicklingError: Could not serialize object: TypeError: can't pickle _MovedItems objects

338

asked Oct 16 '19 15:10

Chinivar Basu

1 Answers

Your function needs to be static in order to define it as an udf. I was looking for some documentation to provide a good explanation, but couldn't really find it.

Basically (maybe not 100% accurate; corrections are appreciated) when you define an udf it gets pickled and copied to each executor automatically, but you can't pickle a single method of a class which is not defined at the top level (the class is part of the top level but not its methods). Have a look at this post for workarounds other than static methods.

import pyspark.sql.functions as F
import pyspark.sql.types as T


class Phases():
  def __init__(self, df1):
    print("Inside the constructor of Class phases ")

    self.df1 = df1
    self.phases_udf = F.udf(Phases.phases_commence,T.IntegerType())

  #This is the UDF. 
  @staticmethod
  def phases_commence(age):
    age = age +3
    return age

  #This is the function that registers the UDF, 
  def doSomething(self):
    print("Inside the doSomething")
    self.df1 = self.df1.withColumn('AgeP2', self.phases_udf(F.col('Age')))

l =[(1,   10   ,  'F')
,(2 ,   2   ,  'M')
,(2 ,  10  ,   'F')
,(2 ,  3  ,    'F')
,(3 ,  10,     'M')]

columns = ['id',  'Age',  'Gender']

df=spark.createDataFrame(l, columns)

bla = Phases(df)
bla.doSomething()
bla.df1.show()

Output:

Inside the constructor of Class phases 
Inside the 'initiate_the_phases_on_the_major_track_segment()' 
+---+---+------+-----+ 
| id|Age|Gender|AgeP2| 
+---+---+------+-----+ 
|  1| 10|     F|   13| 
|  2|  2|     M|    5| 
|  2| 10|     F|   13| 
|  2|  3|     F|    6| 
|  3| 10|     M|   13| 
+---+---+------+-----+

answered Oct 13 '22 11:10

cronoik

Related questions
                            
                                How to position several widgets side by side, on one line, with tkinter?
                            
                                Where does Keras store downloaded data for MNIST?
                            
                                skip second row of dataframe while reading csv file in python
                            
                                Pandas Counting Character Occurrences
                            
                                Pylint UnicodeDecodeError utf-8 can't decode byte
                            
                                after connect to remote database script doesn't exit
                            
                                Python decorator to time recursive functions
                            
                                Why we use range(len) in for loop in python?
                            
                                Python: How to hide output Chrome messages in Selenium?
                            
                                Mypy + flake8: Is there any way to surpress warning of `F821 undefined name`
                            
                                Why does ordering matter in type hinting?
                            
                                How to generate legible plots in pandas when looping over columns?
                            
                                Flask debug mode gives an "OSError: [Errno 8] Exec format error" when running using python
                            
                                Spacy similarity warning : "Evaluating Doc.similarity based on empty vectors."
                            
                                Finding all possible combinations whose sum is within certain range of target
                            
                                How to visualize feasible region for linear programming (with arbitrary inequalities) in Numpy/MatplotLib?
                            
                                TypeError: cannot unpack non-iterable bool object
                            
                                Why does pandas remove leading zero when writing to a csv?
                            
                                Efficiently remove duplicates, order-agnostic, from list of lists
                            
                                Timeout Error in Fraudulent Activity Notification HackerRank

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Pyspark User-Defined_functions inside of a class

Tags:

python-3.x

jupyter-notebook

pyspark

azure-databricks

Chinivar Basu

People also ask

1 Answers

cronoik

Recent Activity

Donate For Us