Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pyspark -- best way to sum values in column of type Array(Integer())

Lets say this is my dataframe ...

name | scores
Dan  |  [10,5,2,12]
Ann  |  [ 12,3,5]
Jon  |  [ ] 

Desired output is something like

name | scores         | Total
Dan  |  [10,5,2,12]   | 29
Ann  |   [ 12,3,5]    | 20
Jon  |  [ ]           | 0

I made a UDF along the lines of ....

sum_cols = udf(lambda arr: if arr == [] then 0 else __builtins__.sum(arr),IntegerType())

df.withColumn('Total', sum_cols(col('scores'))).show()

However, I have learned that UDFs are relatively slow to pure pySpark functions.

Any way to do code above in pySpark without a UDF ?

like image 758
js_55 Avatar asked Dec 15 '17 19:12

js_55


People also ask

How do I sum values in a column in PySpark?

By using the sum() method, we can get the total value from the column, and finally, we can use the collect() method to get the sum from the column. Where, df is the input PySpark DataFrame. column_name is the column to get the sum value.

What does .collect do PySpark?

Collect() is the function, operation for RDD or Dataframe that is used to retrieve the data from the Dataframe. It is used useful in retrieving all the elements of the row from each partition in an RDD and brings that over the driver node/program.

What is Rlike in PySpark?

PySpark SQL rlike() Function Example rlike() evaluates the regex on Column value and returns a Column of type Boolean. rlike() is a function on Column type, for more examples refer to PySpark Column Type & it's Functions.


1 Answers

for Spark 3.1+, you could simply call pyspark.sql.functions.aggregate:

import pyspark.sql.functions as F
df = df.withColumn(
    "Total",
    F.aggregate("scores", F.lit(0), lambda acc, x: acc + x)
)

Notice that you should use F.lit(0.0) if the column type is not integer.

like image 143
johnnyasd12 Avatar answered Sep 29 '22 04:09

johnnyasd12