Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Global variables not recognized in lambda functions in Pyspark

I am working in Pyspark with a lambda function like the following:

udf_func = UserDefinedFunction(lambda value: method1(value, dict_global), IntegerType())
result_col = udf_func(df[atr1])

The implementation of the method1 is the next one:

def method1(value, dict_global):
    result = len(dict_global)
    if (value in dict_global):
        result = dict_global[value]
    return result

'dict_global' is a global dictionary that contains some values.

The problem is that when I execute the lambda function the result is always None. For any reason the 'method1' function doesn't interpret the variable 'dict_global' as an external variable. Why? What could I do?

like image 864
jartymcfly Avatar asked Dec 13 '22 20:12

jartymcfly


1 Answers

Finally I found a solution. I write it below:

Lambda functions (as well as map and reduce functions) executed in SPARK schedule the executions among the different executors, and it works in different execution threads. So the problem in my code could be global variables sometimes are not caught by the functions executed in parallel in different threads, so I looked for a solution to try solve it.

Fortunately, in SPARK there is an element called "Broadcast" which allows to pass variables to the execution of a function organized among the executors to work with them without problems. There are 2 type of sharable variables: Broadcast (inmutable variables, only for read) and accumulators (mutable variables, but numeric values only accepted).

I rewrite my code to show you how did I fix the problem:

broadcastVar = sc.broadcast(dict_global)
udf_func = UserDefinedFunction(lambda value: method1(value, boradcastVar), IntegerType())
result_col = udf_func(df[atr1])

Hope it helps!

like image 71
jartymcfly Avatar answered Jan 03 '23 03:01

jartymcfly