Having this dataframe I am getting Column is not iterable when I try to groupBy and getting max:
linesWithSparkDF
+---+-----+
| id|cycle|
+---+-----+
| 31|   26|
| 31|   28|
| 31|   29|
| 31|   97|
| 31|   98|
| 31|  100|
| 31|  101|
| 31|  111|
| 31|  112|
| 31|  113|
+---+-----+
only showing top 10 rows
ipython-input-41-373452512490> in runlgmodel2(model, data)
     65     linesWithSparkDF.show(10)
     66 
---> 67     linesWithSparkGDF = linesWithSparkDF.groupBy(col("id")).agg(max(col("cycle")))
     68     print "linesWithSparkGDF"
     69 
/usr/hdp/current/spark-client/python/pyspark/sql/column.py in __iter__(self)
    241 
    242     def __iter__(self):
--> 243         raise TypeError("Column is not iterable")
    244 
    245     # string methods
TypeError: Column is not iterable
                Solution for TypeError: Column is not iterable PySpark add_months() function takes the first argument as a column and the second argument is a literal value. if you try to use Column type for the second argument you get “TypeError: Column is not iterable”. In order to fix this use expr() function as shown below.
The syntax for Pyspark Apply Function to Column The Import is to be used for passing the user-defined function. B:- The Data frame model used and the user-defined function that is to be passed for the column name. It takes up the column name as the parameter, and the function can be passed along.
PySpark withColumn() is a transformation function of DataFrame which is used to change the value, convert the datatype of an existing column, create a new column, and many more.
By using the sum() method, we can get the total value from the column, and finally, we can use the collect() method to get the sum from the column. Where, df is the input PySpark DataFrame. column_name is the column to get the sum value.
It's because, you've overwritten the max definition provided by apache-spark, it was easy to spot because max was expecting an iterable.
To fix this, you can use a different syntax, and it should work.
inesWithSparkGDF = linesWithSparkDF.groupBy(col("id")).agg({"cycle": "max"})
or alternatively
from pyspark.sql.functions import max as sparkMax
linesWithSparkGDF = linesWithSparkDF.groupBy(col("id")).agg(sparkMax(col("cycle")))
                        The idiomatic style for avoiding this problem -- which are unfortunate namespace collisions between some Spark SQL function names and Python built-in function names -- is to import the Spark SQL functions module like this:
from pyspark.sql import functions as F 
# USAGE: F.col(), F.max(), F.someFunc(), ...
Then, using the OP's example, you'd simply apply F like this:
linesWithSparkGDF = linesWithSparkDF.groupBy(F.col("id")) \
                               .agg(F.max(F.col("cycle")))
In practice, this is how the problem is avoided idiomatically. =:)
I know the question is old but this might help someone.
First import the following :
from pyspark.sql import functions as F
Then
 linesWithSparkGDF = linesWithSparkDF.groupBy(col("id")).agg(F.max(col("cycle")))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With