lambda or not in PySpark UDF

Question

What is the benefit of using lamdba function in PySpark? Here is an example:

def square(x):
    return float(x**2)

With lambda, I tried this:

f_square = udf(lambda x: square(x), FloatType())
result_w_square = result.withColumn('square', f_square(result.x))

Without lambda, I tried this:

f_square = udf(square, FloatType())
result_w_square2 = result.withColumn('square', f_square(result.x))

I got the same result. Which approach is better?

Hyunsik Choi · Accepted Answer

withColumn and other Spark Python API functions are intended to take python expressions to run the same expressions across remote machines.

However, Python functions can take only objects as parameters rather than expressions. To deal with expressions as objects, the only way is to write a function containing the expressions. In Python, the function is the first class object.

However, if you don't reuse your expressions, writing functions every time can be a troublesome. With lambda, you can write an anonymous function without any function definition. Writing lambda expressions can be concise in many cases.

So, depending on whether you reuse the expressions or not, you can choose either way.

lambda or not in PySpark UDF

Tags:

lambda

pyspark

kee

1 Answers

Hyunsik Choi

Recent Activity

Donate For Us

lambda or not in PySpark UDF

Tags:

lambda

pyspark

kee

1 Answers

Hyunsik Choi

Related questions

Recent Activity

Donate For Us