I have a dataframe with the folowing types:
>>> mydf.printSchema()
root
|-- protocol: string (nullable = true)
|-- source_port: long (nullable = true)
|-- bytes: long (nullable = true)
And when I try to aggregate it like this:
df_agg = mydf.groupBy('protocol').agg(sum('bytes'))
I am being told:
TypeError: unsupported operand type(s) for +: 'int' and 'str'
Now, this does not make sense to me, since I see the types are fine for aggregation in printSchema()
as you can see above.
So, I tried converting it to integer just incase:
mydf_converted = mydf.withColumn("converted",mydf["bytes_out"].cast(IntegerType()).alias("bytes_converted"))
but still failure:
my_df_agg_converted = mydf_converted.groupBy('protocol').agg(sum('converted'))
TypeError: unsupported operand type(s) for +: 'int' and 'str'
How to fix it? I looked at this question, but the fix did not help me at all - same issue: Sum operation on PySpark DataFrame giving TypeError when type is fine
Python is confused between its built-in sum
function and the pyspark aggregation sum
function you want to use. So you're basically passing a string 'converted'
to the python built-in sum
function which expects an iterable of int.
Try loading pyspark functions
with an alias instead:
import pyspark.sql.functions as psf
my_df_agg_converted = mydf_converted.groupBy('protocol').agg(psf.sum('converted'))
This will tell it to use the pyspark
function rather than the built in one.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With