Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PySpark - Sum a column in dataframe and return results as int

I have a pyspark dataframe with a column of numbers. I need to sum that column and then have the result return as an int in a python variable.

df = spark.createDataFrame([("A", 20), ("B", 30), ("D", 80)],["Letter", "Number"]) 

I do the following to sum the column.

df.groupBy().sum() 

But I get a dataframe back.

+-----------+ |sum(Number)| +-----------+ |        130| +-----------+ 

I would 130 returned as an int stored in a variable to be used else where in the program.

result = 130 
like image 642
Bryce Ramgovind Avatar asked Dec 14 '17 11:12

Bryce Ramgovind


2 Answers

I think the simplest way:

df.groupBy().sum().collect() 

will return a list. In your example:

In [9]: df.groupBy().sum().collect()[0][0] Out[9]: 130 
like image 85
Olivier Darrouzet Avatar answered Sep 29 '22 08:09

Olivier Darrouzet


The simplest way really :

df.groupBy().sum().collect() 

But it is very slow operation: Avoid groupByKey, you should use RDD and reduceByKey:

df.rdd.map(lambda x: (1,x[1])).reduceByKey(lambda x,y: x + y).collect()[0][1] 

I tried on a bigger dataset and i measured the processing time:

RDD and ReduceByKey : 2.23 s

GroupByKey: 30.5 s

like image 35
Aron Asztalos Avatar answered Sep 29 '22 10:09

Aron Asztalos