pyspark -- best way to sum values in column of type Array(Integer())

Tags:

Lets say this is my dataframe ...

name | scores
Dan  |  [10,5,2,12]
Ann  |  [ 12,3,5]
Jon  |  [ ]

Desired output is something like

name | scores         | Total
Dan  |  [10,5,2,12]   | 29
Ann  |   [ 12,3,5]    | 20
Jon  |  [ ]           | 0

I made a UDF along the lines of ....

sum_cols = udf(lambda arr: if arr == [] then 0 else __builtins__.sum(arr),IntegerType())

df.withColumn('Total', sum_cols(col('scores'))).show()

However, I have learned that UDFs are relatively slow to pure pySpark functions.

Any way to do code above in pySpark without a UDF ?

758

asked Dec 15 '17 19:12

js_55

1 Answers

for Spark 3.1+, you could simply call pyspark.sql.functions.aggregate:

import pyspark.sql.functions as F
df = df.withColumn(
    "Total",
    F.aggregate("scores", F.lit(0), lambda acc, x: acc + x)
)

Notice that you should use F.lit(0.0) if the column type is not integer.

143

answered Sep 29 '22 04:09

johnnyasd12

Related questions
                            
                                Implementing custom Spark RDD in Java
                            
                                Spark MLLib Kmeans from dataframe, and back again
                            
                                Spark __getnewargs__ error
                            
                                Spark: driver/worker configuration. Does driver run on Master node?
                            
                                More than one hour to execute pyspark.sql.DataFrame.take(4)
                            
                                spark.driver.extraClassPath Multiple Jars
                            
                                Spark DataFrame equivalent to Pandas Dataframe `.iloc()` method?
                            
                                How to use from_json with schema as string (i.e. a JSON-encoded schema)?
                            
                                Spark: count percentage percentages of a column values
                            
                                TypeError: 'Column' object is not callable using WithColumn
                            
                                The purpose of ClosureCleaner.clean
                            
                                How to get WebUI URI from SparkContext
                            
                                how to deal with error SPARK-5063 in spark
                            
                                'Connection Refused' error while running Spark Streaming on local machine
                            
                                Spark write Parquet to S3 the last task takes forever
                            
                                What is the difference between Spark DataSet and RDD
                            
                                In Spark is counting the records in an RDD expensive task?
                            
                                YARN: What is the difference between number-of-executors and executor-cores in Spark?
                            
                                Difference between QuantileDiscretizer and Bucketizer in Spark
                            
                                How to know which count query is the fastest?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

pyspark -- best way to sum values in column of type Array(Integer())

Tags:

apache-spark

apache-spark-sql

pyspark

spark-dataframe

js_55

People also ask

1 Answers

johnnyasd12

Recent Activity

Donate For Us