I am working in Pyspark with a lambda function like the following:
udf_func = UserDefinedFunction(lambda value: method1(value, dict_global), IntegerType())
result_col = udf_func(df[atr1])
The implementation of the method1 is the next one:
def method1(value, dict_global):
result = len(dict_global)
if (value in dict_global):
result = dict_global[value]
return result
'dict_global' is a global dictionary that contains some values.
The problem is that when I execute the lambda function the result is always None. For any reason the 'method1' function doesn't interpret the variable 'dict_global' as an external variable. Why? What could I do?
Finally I found a solution. I write it below:
Lambda functions (as well as map and reduce functions) executed in SPARK schedule the executions among the different executors, and it works in different execution threads. So the problem in my code could be global variables sometimes are not caught by the functions executed in parallel in different threads, so I looked for a solution to try solve it.
Fortunately, in SPARK there is an element called "Broadcast" which allows to pass variables to the execution of a function organized among the executors to work with them without problems. There are 2 type of sharable variables: Broadcast (inmutable variables, only for read) and accumulators (mutable variables, but numeric values only accepted).
I rewrite my code to show you how did I fix the problem:
broadcastVar = sc.broadcast(dict_global)
udf_func = UserDefinedFunction(lambda value: method1(value, boradcastVar), IntegerType())
result_col = udf_func(df[atr1])
Hope it helps!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With