What is the benefit of using lamdba function in PySpark? Here is an example:
def square(x):
return float(x**2)
With lambda, I tried this:
f_square = udf(lambda x: square(x), FloatType())
result_w_square = result.withColumn('square', f_square(result.x))
Without lambda, I tried this:
f_square = udf(square, FloatType())
result_w_square2 = result.withColumn('square', f_square(result.x))
I got the same result. Which approach is better?
withColumn and other Spark Python API functions are intended to take python expressions to run the same expressions across remote machines.
However, Python functions can take only objects as parameters rather than expressions. To deal with expressions as objects, the only way is to write a function containing the expressions. In Python, the function is the first class object.
However, if you don't reuse your expressions, writing functions every time can be a troublesome. With lambda, you can write an anonymous function without any function definition. Writing lambda expressions can be concise in many cases.
So, depending on whether you reuse the expressions or not, you can choose either way.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With