Lets say this is my dataframe ...
name | scores
Dan | [10,5,2,12]
Ann | [ 12,3,5]
Jon | [ ]
Desired output is something like
name | scores | Total
Dan | [10,5,2,12] | 29
Ann | [ 12,3,5] | 20
Jon | [ ] | 0
I made a UDF along the lines of ....
sum_cols = udf(lambda arr: if arr == [] then 0 else __builtins__.sum(arr),IntegerType())
df.withColumn('Total', sum_cols(col('scores'))).show()
However, I have learned that UDFs are relatively slow to pure pySpark functions.
Any way to do code above in pySpark without a UDF ?
By using the sum() method, we can get the total value from the column, and finally, we can use the collect() method to get the sum from the column. Where, df is the input PySpark DataFrame. column_name is the column to get the sum value.
Collect() is the function, operation for RDD or Dataframe that is used to retrieve the data from the Dataframe. It is used useful in retrieving all the elements of the row from each partition in an RDD and brings that over the driver node/program.
PySpark SQL rlike() Function Example rlike() evaluates the regex on Column value and returns a Column of type Boolean. rlike() is a function on Column type, for more examples refer to PySpark Column Type & it's Functions.
for Spark 3.1+, you could simply call pyspark.sql.functions.aggregate
:
import pyspark.sql.functions as F
df = df.withColumn(
"Total",
F.aggregate("scores", F.lit(0), lambda acc, x: acc + x)
)
Notice that you should use F.lit(0.0)
if the column type is not integer.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With