I have a pyspark dataframe with a column of numbers. I need to sum that column and then have the result return as an int in a python variable.
df = spark.createDataFrame([("A", 20), ("B", 30), ("D", 80)],["Letter", "Number"])
I do the following to sum the column.
df.groupBy().sum()
But I get a dataframe back.
+-----------+ |sum(Number)| +-----------+ | 130| +-----------+
I would 130 returned as an int stored in a variable to be used else where in the program.
result = 130
I think the simplest way:
df.groupBy().sum().collect()
will return a list. In your example:
In [9]: df.groupBy().sum().collect()[0][0] Out[9]: 130
The simplest way really :
df.groupBy().sum().collect()
But it is very slow operation: Avoid groupByKey, you should use RDD and reduceByKey:
df.rdd.map(lambda x: (1,x[1])).reduceByKey(lambda x,y: x + y).collect()[0][1]
I tried on a bigger dataset and i measured the processing time:
RDD and ReduceByKey : 2.23 s
GroupByKey: 30.5 s
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With