say I have a dataframe like this
name age city
abc 20 A
def 30 B
i want to add a summary row at the end of the dataframe, so result will be like
name age city
abc 20 A
def 30 B
All 50 All
So String 'All', I can easily put, but how to get sum(df['age']) ###column object is not iterable
data = spark.createDataFrame([("abc", 20, "A"), ("def", 30, "B")],["name", "age", "city"])
data.printSchema()
#root
#|-- name: string (nullable = true)
#|-- age: long (nullable = true)
#|-- city: string (nullable = true)
res = data.union(spark.createDataFrame([('All',sum(data['age']),'All')], data.columns)) ## TypeError: Column is not iterable
#Even tried with data['age'].sum() and got error. If i am using [('All',50,'All')], it is doing fine.
I usually work on Pandas dataframe and new to Spark. Might be my undestanding about spark dataframe is not that matured.
Please suggest, how to get the sum over a dataframe-column in pyspark. And if there is any better way to add/append a row to end of a dataframe. Thanks.
Spark SQL has a dedicated module for column functions pyspark.sql.functions
.
So the way it works is:
from pyspark.sql import functions as F
data = spark.createDataFrame([("abc", 20, "A"), ("def", 30, "B")],["name", "age", "city"])
res = data.unionAll(
data.select([
F.lit('All').alias('name'), # create a cloumn named 'name' and filled with 'All'
F.sum(data.age).alias('age'), # get the sum of 'age'
F.lit('All').alias('city') # create a column named 'city' and filled with 'All'
]))
res.show()
Prints:
+----+---+----+
|name|age|city|
+----+---+----+
| abc| 20| A|
| def| 30| B|
| All| 50| All|
+----+---+----+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With