Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

lambda or not in PySpark UDF

Tags:

lambda

pyspark

What is the benefit of using lamdba function in PySpark? Here is an example:

def square(x):
    return float(x**2)

With lambda, I tried this:

f_square = udf(lambda x: square(x), FloatType())
result_w_square = result.withColumn('square', f_square(result.x))

Without lambda, I tried this:

f_square = udf(square, FloatType())
result_w_square2 = result.withColumn('square', f_square(result.x))

I got the same result. Which approach is better?

like image 336
kee Avatar asked May 22 '26 02:05

kee


1 Answers

withColumn and other Spark Python API functions are intended to take python expressions to run the same expressions across remote machines.

However, Python functions can take only objects as parameters rather than expressions. To deal with expressions as objects, the only way is to write a function containing the expressions. In Python, the function is the first class object.

However, if you don't reuse your expressions, writing functions every time can be a troublesome. With lambda, you can write an anonymous function without any function definition. Writing lambda expressions can be concise in many cases.

So, depending on whether you reuse the expressions or not, you can choose either way.

like image 150
Hyunsik Choi Avatar answered May 24 '26 15:05

Hyunsik Choi



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!