Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does pyspark agg tell me that datatypes are incorrect here?

I have a dataframe with the folowing types:

>>> mydf.printSchema()
root
 |-- protocol: string (nullable = true)
 |-- source_port: long (nullable = true)
 |-- bytes: long (nullable = true)

And when I try to aggregate it like this:

df_agg = mydf.groupBy('protocol').agg(sum('bytes'))

I am being told:

TypeError: unsupported operand type(s) for +: 'int' and 'str'

Now, this does not make sense to me, since I see the types are fine for aggregation in printSchema() as you can see above.

So, I tried converting it to integer just incase:

mydf_converted = mydf.withColumn("converted",mydf["bytes_out"].cast(IntegerType()).alias("bytes_converted"))

but still failure:

my_df_agg_converted = mydf_converted.groupBy('protocol').agg(sum('converted'))

TypeError: unsupported operand type(s) for +: 'int' and 'str'

How to fix it? I looked at this question, but the fix did not help me at all - same issue: Sum operation on PySpark DataFrame giving TypeError when type is fine

like image 443
makansij Avatar asked Oct 14 '25 23:10

makansij


1 Answers

Python is confused between its built-in sum function and the pyspark aggregation sum function you want to use. So you're basically passing a string 'converted' to the python built-in sum function which expects an iterable of int.

Try loading pyspark functions with an alias instead:

import pyspark.sql.functions as psf
my_df_agg_converted = mydf_converted.groupBy('protocol').agg(psf.sum('converted'))

This will tell it to use the pyspark function rather than the built in one.

like image 93
MaFF Avatar answered Oct 17 '25 13:10

MaFF



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!