Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pyspark Column is not iterable

Having this dataframe I am getting Column is not iterable when I try to groupBy and getting max:

linesWithSparkDF
+---+-----+
| id|cycle|
+---+-----+
| 31|   26|
| 31|   28|
| 31|   29|
| 31|   97|
| 31|   98|
| 31|  100|
| 31|  101|
| 31|  111|
| 31|  112|
| 31|  113|
+---+-----+
only showing top 10 rows


ipython-input-41-373452512490> in runlgmodel2(model, data)
     65     linesWithSparkDF.show(10)
     66 
---> 67     linesWithSparkGDF = linesWithSparkDF.groupBy(col("id")).agg(max(col("cycle")))
     68     print "linesWithSparkGDF"
     69 

/usr/hdp/current/spark-client/python/pyspark/sql/column.py in __iter__(self)
    241 
    242     def __iter__(self):
--> 243         raise TypeError("Column is not iterable")
    244 
    245     # string methods

TypeError: Column is not iterable
like image 596
oluies Avatar asked Apr 28 '16 20:04

oluies


People also ask

How do I make a column iterable in PySpark?

Solution for TypeError: Column is not iterable PySpark add_months() function takes the first argument as a column and the second argument is a literal value. if you try to use Column type for the second argument you get “TypeError: Column is not iterable”. In order to fix this use expr() function as shown below.

How do you apply a function to a column in PySpark DataFrame?

The syntax for Pyspark Apply Function to Column The Import is to be used for passing the user-defined function. B:- The Data frame model used and the user-defined function that is to be passed for the column name. It takes up the column name as the parameter, and the function can be passed along.

What is withColumn PySpark?

PySpark withColumn() is a transformation function of DataFrame which is used to change the value, convert the datatype of an existing column, create a new column, and many more.

How do you sum a column in PySpark?

By using the sum() method, we can get the total value from the column, and finally, we can use the collect() method to get the sum from the column. Where, df is the input PySpark DataFrame. column_name is the column to get the sum value.


Video Answer


3 Answers

It's because, you've overwritten the max definition provided by apache-spark, it was easy to spot because max was expecting an iterable.

To fix this, you can use a different syntax, and it should work.

inesWithSparkGDF = linesWithSparkDF.groupBy(col("id")).agg({"cycle": "max"})

or alternatively

from pyspark.sql.functions import max as sparkMax

linesWithSparkGDF = linesWithSparkDF.groupBy(col("id")).agg(sparkMax(col("cycle")))
like image 175
Alberto Bonsanto Avatar answered Oct 22 '22 01:10

Alberto Bonsanto


The idiomatic style for avoiding this problem -- which are unfortunate namespace collisions between some Spark SQL function names and Python built-in function names -- is to import the Spark SQL functions module like this:

from pyspark.sql import functions as F 
# USAGE: F.col(), F.max(), F.someFunc(), ...

Then, using the OP's example, you'd simply apply F like this:

linesWithSparkGDF = linesWithSparkDF.groupBy(F.col("id")) \
                               .agg(F.max(F.col("cycle")))

In practice, this is how the problem is avoided idiomatically. =:)

like image 25
NYCeyes Avatar answered Oct 22 '22 03:10

NYCeyes


I know the question is old but this might help someone.

First import the following :

from pyspark.sql import functions as F

Then

linesWithSparkGDF = linesWithSparkDF.groupBy(col("id")).agg(F.max(col("cycle")))

like image 2
SamaAdi Avatar answered Oct 22 '22 03:10

SamaAdi