Having this dataframe I am getting Column is not iterable when I try to groupBy and getting max: <pre class="prettyprint"><code>linesWithSparkDF +---+-----+ | id|cycle| +---+-----+ | 31| 26| | 31| 28| | 31| 29| | 31| 97| | 31| 98| | 31| 100| | 31| 101| | 31| 111| | 31| 112| | 31| 113| +---+-----+ only showing top 10 rows ipython-input-41-373452512490> in runlgmodel2(model, data) 65 linesWithSparkDF.show(10) 66 ---> 67 linesWithSparkGDF = linesWithSparkDF.groupBy(col("id")).agg(max(col("cycle"))) 68 print "linesWithSparkGDF" 69 /usr/hdp/current/spark-client/python/pyspark/sql/column.py in __iter__(self) 241 242 def __iter__(self): --> 243 raise TypeError("Column is not iterable") 244 245 # string methods TypeError: Column is not iterable </code></pre>

It's because, you've overwritten the <code>max</code> definition provided by <code>apache-spark</code>, it was easy to spot because <code>max</code> was expecting an <code>iterable</code>. To fix this, you can use a different syntax, and it should work. <pre class="prettyprint lang-py prettyprint-override"><code>inesWithSparkGDF = linesWithSparkDF.groupBy(col("id")).agg({"cycle": "max"}) </code></pre> or alternatively <pre class="prettyprint lang-py prettyprint-override"><code>from pyspark.sql.functions import max as sparkMax linesWithSparkGDF = linesWithSparkDF.groupBy(col("id")).agg(sparkMax(col("cycle"))) </code></pre>

The idiomatic style for avoiding this problem -- which are unfortunate namespace collisions between some Spark SQL function names and Python built-in function names -- is to <code>import</code> the <code>Spark SQL functions module</code> like this: <pre class="prettyprint"><code>from pyspark.sql import functions as F # USAGE: F.col(), F.max(), F.someFunc(), ... </code></pre> Then, using the <code>OP's</code> example, you'd simply apply <code>F</code> like this: <pre class="prettyprint"><code>linesWithSparkGDF = linesWithSparkDF.groupBy(F.col("id")) \ .agg(F.max(F.col("cycle"))) </code></pre> In practice, this is how the problem is avoided idiomatically. <code>=:)</code>

I know the question is old but this might help someone. First import the following : <code>from pyspark.sql import functions as F</code> Then <code> linesWithSparkGDF = linesWithSparkDF.groupBy(col("id")).agg(F.max(col("cycle")))</code>

pyspark Column is not iterable

Tags:

apache-spark

pyspark

Having this dataframe I am getting Column is not iterable when I try to groupBy and getting max:

linesWithSparkDF
+---+-----+
| id|cycle|
+---+-----+
| 31|   26|
| 31|   28|
| 31|   29|
| 31|   97|
| 31|   98|
| 31|  100|
| 31|  101|
| 31|  111|
| 31|  112|
| 31|  113|
+---+-----+
only showing top 10 rows


ipython-input-41-373452512490> in runlgmodel2(model, data)
     65     linesWithSparkDF.show(10)
     66 
---> 67     linesWithSparkGDF = linesWithSparkDF.groupBy(col("id")).agg(max(col("cycle")))
     68     print "linesWithSparkGDF"
     69 

/usr/hdp/current/spark-client/python/pyspark/sql/column.py in __iter__(self)
    241 
    242     def __iter__(self):
--> 243         raise TypeError("Column is not iterable")
    244 
    245     # string methods

TypeError: Column is not iterable

596

asked Apr 28 '16 20:04

oluies

Video Answer

3 Answers

It's because, you've overwritten the max definition provided by apache-spark, it was easy to spot because max was expecting an iterable.

To fix this, you can use a different syntax, and it should work.

inesWithSparkGDF = linesWithSparkDF.groupBy(col("id")).agg({"cycle": "max"})

or alternatively

from pyspark.sql.functions import max as sparkMax

linesWithSparkGDF = linesWithSparkDF.groupBy(col("id")).agg(sparkMax(col("cycle")))

175

answered Oct 22 '22 01:10

Alberto Bonsanto

The idiomatic style for avoiding this problem -- which are unfortunate namespace collisions between some Spark SQL function names and Python built-in function names -- is to import the Spark SQL functions module like this:

from pyspark.sql import functions as F 
# USAGE: F.col(), F.max(), F.someFunc(), ...

Then, using the OP's example, you'd simply apply F like this:

linesWithSparkGDF = linesWithSparkDF.groupBy(F.col("id")) \
                               .agg(F.max(F.col("cycle")))

In practice, this is how the problem is avoided idiomatically. =:)

answered Oct 22 '22 03:10

NYCeyes

I know the question is old but this might help someone.

First import the following :

from pyspark.sql import functions as F

Then

linesWithSparkGDF = linesWithSparkDF.groupBy(col("id")).agg(F.max(col("cycle")))

answered Oct 22 '22 03:10

SamaAdi

Related questions
                            
                                Creating a custom Spark RDD in Python
                            
                                Use directories for partition pruning in Spark SQL
                            
                                Add jar to pyspark when using notebook
                            
                                How to Stop Spark Streaming
                            
                                Does Spark SQL include a table streaming optimization for joins?
                            
                                Caching factor of MatrixFactorizationModel in PySpark
                            
                                Convert JSON objects to RDD
                            
                                Container killed by YARN for exceeding memory limits. 52.6 GB of 50 GB physical memory used. Consider boosting spark.yarn.executor.memoryOverhead
                            
                                Checkpoint RDD ReliableCheckpointRDD has different number of partitions from original RDD
                            
                                Why does Spark ML NaiveBayes output labels that are different from the training data?
                            
                                Spark SQL referencing attributes of UDT
                            
                                Large task size for simplest program
                            
                                When create two different Spark Pair RDD with same key set, will Spark distribute partition with same key to the same machine?
                            
                                Error starting pyspark with options (Without Spack packages)
                            
                                How to pass one RDD in another RDD through .map
                            
                                Spark IDF for new documents
                            
                                Using Spark for sequential row-by-row processing without map and reduce
                            
                                From TF-IDF to LDA clustering in spark, pyspark
                            
                                Collapse a Spark DataFrame
                            
                                java.lang.NoClassDefFoundError: kafka/common/TopicAndPartition

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With