Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Sum operation on PySpark DataFrame giving TypeError when type is fine

Tags:

I have such DataFrame in PySpark (this is the result of a take(3), the dataframe is very big):

sc = SparkContext() df = [Row(owner=u'u1', a_d=0.1), Row(owner=u'u2', a_d=0.0), Row(owner=u'u1', a_d=0.3)] 

the same owner will have more rows. What I need to do is summing the values of the field a_d per owner, after grouping, as

b = df.groupBy('owner').agg(sum('a_d').alias('a_d_sum')) 

but this throws error

TypeError: unsupported operand type(s) for +: 'int' and 'str'

However, the schema contains double values, not strings (this comes from a printSchema()):

root |-- owner: string (nullable = true) |-- a_d: double (nullable = true) 

So what is happening here?

like image 662
mar tin Avatar asked Apr 19 '16 12:04

mar tin


People also ask

How do you sum rows in PySpark?

Row wise sum in pyspark is calculated using sum() function. Row wise minimum (min) in pyspark is calculated using least() function. Row wise maximum (max) in pyspark is calculated using greatest() function.

What is F explode in PySpark?

Introduction to PySpark explode. PYSPARK EXPLODE is an Explode function that is used in the PySpark data model to explode an array or map-related columns to row in PySpark. It explodes the columns and separates them not a new row in PySpark. It returns a new row for each element in an array or map.

What does .collect do in PySpark?

Collect() is the function, operation for RDD or Dataframe that is used to retrieve the data from the Dataframe. It is used useful in retrieving all the elements of the row from each partition in an RDD and brings that over the driver node/program.


1 Answers

You are not using the correct sum function but the built-in function sum (by default).

So the reason why the build-in function won't work is that's it takes an iterable as an argument where as here the name of the column passed is a string and the built-in function can't be applied on a string. Ref. Python Official Documentation.

You'll need to import the proper function from pyspark.sql.functions :

from pyspark.sql import Row from pyspark.sql.functions import sum as _sum  df = sqlContext.createDataFrame(     [Row(owner=u'u1', a_d=0.1), Row(owner=u'u2', a_d=0.0), Row(owner=u'u1', a_d=0.3)] )  df2 = df.groupBy('owner').agg(_sum('a_d').alias('a_d_sum')) df2.show()  # +-----+-------+ # |owner|a_d_sum| # +-----+-------+ # |   u1|    0.4| # |   u2|    0.0| # +-----+-------+ 
like image 87
eliasah Avatar answered Sep 30 '22 20:09

eliasah