I have such DataFrame in PySpark (this is the result of a take(3), the dataframe is very big):
sc = SparkContext() df = [Row(owner=u'u1', a_d=0.1), Row(owner=u'u2', a_d=0.0), Row(owner=u'u1', a_d=0.3)]
the same owner will have more rows. What I need to do is summing the values of the field a_d per owner, after grouping, as
b = df.groupBy('owner').agg(sum('a_d').alias('a_d_sum'))
but this throws error
TypeError: unsupported operand type(s) for +: 'int' and 'str'
However, the schema contains double values, not strings (this comes from a printSchema()):
root |-- owner: string (nullable = true) |-- a_d: double (nullable = true)
So what is happening here?
Row wise sum in pyspark is calculated using sum() function. Row wise minimum (min) in pyspark is calculated using least() function. Row wise maximum (max) in pyspark is calculated using greatest() function.
Introduction to PySpark explode. PYSPARK EXPLODE is an Explode function that is used in the PySpark data model to explode an array or map-related columns to row in PySpark. It explodes the columns and separates them not a new row in PySpark. It returns a new row for each element in an array or map.
Collect() is the function, operation for RDD or Dataframe that is used to retrieve the data from the Dataframe. It is used useful in retrieving all the elements of the row from each partition in an RDD and brings that over the driver node/program.
You are not using the correct sum function but the built-in
function sum
(by default).
So the reason why the build-in
function won't work is that's it takes an iterable as an argument where as here the name of the column passed is a string and the built-in
function can't be applied on a string. Ref. Python Official Documentation.
You'll need to import the proper function from pyspark.sql.functions
:
from pyspark.sql import Row from pyspark.sql.functions import sum as _sum df = sqlContext.createDataFrame( [Row(owner=u'u1', a_d=0.1), Row(owner=u'u2', a_d=0.0), Row(owner=u'u1', a_d=0.3)] ) df2 = df.groupBy('owner').agg(_sum('a_d').alias('a_d_sum')) df2.show() # +-----+-------+ # |owner|a_d_sum| # +-----+-------+ # | u1| 0.4| # | u2| 0.0| # +-----+-------+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With