Having this dataframe I am getting Column is not iterable when I try to groupBy and getting max:
linesWithSparkDF
+---+-----+
| id|cycle|
+---+-----+
| 31| 26|
| 31| 28|
| 31| 29|
| 31| 97|
| 31| 98|
| 31| 100|
| 31| 101|
| 31| 111|
| 31| 112|
| 31| 113|
+---+-----+
only showing top 10 rows
ipython-input-41-373452512490> in runlgmodel2(model, data)
65 linesWithSparkDF.show(10)
66
---> 67 linesWithSparkGDF = linesWithSparkDF.groupBy(col("id")).agg(max(col("cycle")))
68 print "linesWithSparkGDF"
69
/usr/hdp/current/spark-client/python/pyspark/sql/column.py in __iter__(self)
241
242 def __iter__(self):
--> 243 raise TypeError("Column is not iterable")
244
245 # string methods
TypeError: Column is not iterable
Solution for TypeError: Column is not iterable PySpark add_months() function takes the first argument as a column and the second argument is a literal value. if you try to use Column type for the second argument you get “TypeError: Column is not iterable”. In order to fix this use expr() function as shown below.
The syntax for Pyspark Apply Function to Column The Import is to be used for passing the user-defined function. B:- The Data frame model used and the user-defined function that is to be passed for the column name. It takes up the column name as the parameter, and the function can be passed along.
PySpark withColumn() is a transformation function of DataFrame which is used to change the value, convert the datatype of an existing column, create a new column, and many more.
By using the sum() method, we can get the total value from the column, and finally, we can use the collect() method to get the sum from the column. Where, df is the input PySpark DataFrame. column_name is the column to get the sum value.
It's because, you've overwritten the max
definition provided by apache-spark
, it was easy to spot because max
was expecting an iterable
.
To fix this, you can use a different syntax, and it should work.
inesWithSparkGDF = linesWithSparkDF.groupBy(col("id")).agg({"cycle": "max"})
or alternatively
from pyspark.sql.functions import max as sparkMax
linesWithSparkGDF = linesWithSparkDF.groupBy(col("id")).agg(sparkMax(col("cycle")))
The idiomatic style for avoiding this problem -- which are unfortunate namespace collisions between some Spark SQL function names and Python built-in function names -- is to import
the Spark SQL functions module
like this:
from pyspark.sql import functions as F
# USAGE: F.col(), F.max(), F.someFunc(), ...
Then, using the OP's
example, you'd simply apply F
like this:
linesWithSparkGDF = linesWithSparkDF.groupBy(F.col("id")) \
.agg(F.max(F.col("cycle")))
In practice, this is how the problem is avoided idiomatically. =:)
I know the question is old but this might help someone.
First import the following :
from pyspark.sql import functions as F
Then
linesWithSparkGDF = linesWithSparkDF.groupBy(col("id")).agg(F.max(col("cycle")))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With