I have such DataFrame in PySpark (this is the result of a take(3), the dataframe is very big): <pre class="prettyprint"><code>sc = SparkContext() df = [Row(owner=u'u1', a_d=0.1), Row(owner=u'u2', a_d=0.0), Row(owner=u'u1', a_d=0.3)] </code></pre> the same owner will have more rows. What I need to do is summing the values of the field a_d per owner, after grouping, as <pre class="prettyprint"><code>b = df.groupBy('owner').agg(sum('a_d').alias('a_d_sum')) </code></pre> but this throws error <blockquote> TypeError: unsupported operand type(s) for +: 'int' and 'str' </blockquote> However, the schema contains double values, not strings (this comes from a printSchema()): <pre class="prettyprint"><code>root |-- owner: string (nullable = true) |-- a_d: double (nullable = true) </code></pre> So what is happening here?

You are not using the correct sum function but the <code>built-in</code> function <code>sum</code> (by default). So the reason why the <code>build-in</code> function won't work is that's it takes an iterable as an argument where as here the name of the column passed is a string and the <code>built-in</code> function can't be applied on a string. Ref. Python Official Documentation. You'll need to import the proper function from <code>pyspark.sql.functions</code> : <pre class="prettyprint"><code>from pyspark.sql import Row from pyspark.sql.functions import sum as _sum df = sqlContext.createDataFrame( [Row(owner=u'u1', a_d=0.1), Row(owner=u'u2', a_d=0.0), Row(owner=u'u1', a_d=0.3)] ) df2 = df.groupBy('owner').agg(_sum('a_d').alias('a_d_sum')) df2.show() # +-----+-------+ # |owner|a_d_sum| # +-----+-------+ # | u1| 0.4| # | u2| 0.0| # +-----+-------+ </code></pre>

Sum operation on PySpark DataFrame giving TypeError when type is fine

Tags:

I have such DataFrame in PySpark (this is the result of a take(3), the dataframe is very big):

sc = SparkContext() df = [Row(owner=u'u1', a_d=0.1), Row(owner=u'u2', a_d=0.0), Row(owner=u'u1', a_d=0.3)]

the same owner will have more rows. What I need to do is summing the values of the field a_d per owner, after grouping, as

b = df.groupBy('owner').agg(sum('a_d').alias('a_d_sum'))

but this throws error

TypeError: unsupported operand type(s) for +: 'int' and 'str'

However, the schema contains double values, not strings (this comes from a printSchema()):

root |-- owner: string (nullable = true) |-- a_d: double (nullable = true)

So what is happening here?

662

asked Apr 19 '16 12:04

mar tin

1 Answers

You are not using the correct sum function but the built-in function sum (by default).

So the reason why the build-in function won't work is that's it takes an iterable as an argument where as here the name of the column passed is a string and the built-in function can't be applied on a string. Ref. Python Official Documentation.

You'll need to import the proper function from pyspark.sql.functions :

from pyspark.sql import Row from pyspark.sql.functions import sum as _sum  df = sqlContext.createDataFrame(     [Row(owner=u'u1', a_d=0.1), Row(owner=u'u2', a_d=0.0), Row(owner=u'u1', a_d=0.3)] )  df2 = df.groupBy('owner').agg(_sum('a_d').alias('a_d_sum')) df2.show()  # +-----+-------+ # |owner|a_d_sum| # +-----+-------+ # |   u1|    0.4| # |   u2|    0.0| # +-----+-------+

answered Sep 30 '22 20:09

eliasah

Related questions
                            
                                NodeJS - nodemon not restarting my server
                            
                                Difference in Azure Availability Sets and Scale Sets
                            
                                Accessing firebase.storage() with AngularFire2 (Angular2 rc.5)
                            
                                How to install xgboost in python on MacOS?
                            
                                Dart - NumberFormat
                            
                                MANIFEST.MF (The system cannot find the path specified)
                            
                                How to create MUI Dialog with transparent background color?
                            
                                Executing functions within switch dictionary
                            
                                Generic type 'Observable<T>' requires 1 type argument
                            
                                Bootstrap navbar-fixed-top class, not working
                            
                                How to Moq Mock a LoggerFactory in C# AspNet Core
                            
                                Firebase - Firestore - get key with collection.add()

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With