E.g <pre class="prettyprint"><code>sqlContext = SQLContext(sc) sample=sqlContext.sql("select Name ,age ,city from user") sample.show() </code></pre> The above statement prints theentire table on terminal. But I want to access each row in that table using <code>for</code> or <code>while</code> to perform further calculations.

You simply cannot. <code>DataFrames</code>, same as other distributed data structures, are not iterable and can be accessed using only dedicated higher order function and / or SQL methods. You can of course <code>collect</code> <pre class="prettyprint"><code>for row in df.rdd.collect(): do_something(row) </code></pre> or convert <code>toLocalIterator</code> <pre class="prettyprint"><code>for row in df.rdd.toLocalIterator(): do_something(row) </code></pre> and iterate locally as shown above, but it beats all purpose of using Spark.

how to loop through each row of dataFrame in pyspark

Tags:

for-loop

dataframe

apache-spark

apache-spark-sql

pyspark

E.g

sqlContext = SQLContext(sc)  sample=sqlContext.sql("select Name ,age ,city from user") sample.show()

The above statement prints theentire table on terminal. But I want to access each row in that table using for or while to perform further calculations.

560

asked Apr 01 '16 06:04

Arti Berde

2 Answers

You simply cannot. DataFrames, same as other distributed data structures, are not iterable and can be accessed using only dedicated higher order function and / or SQL methods.

You can of course collect

for row in df.rdd.collect():     do_something(row)

or convert toLocalIterator

for row in df.rdd.toLocalIterator():     do_something(row)

and iterate locally as shown above, but it beats all purpose of using Spark.

188

answered Sep 21 '22 21:09

2 revs, 2 users 70%

To "loop" and take advantage of Spark's parallel computation framework, you could define a custom function and use map.

def customFunction(row):     return (row.name, row.age, row.city)  sample2 = sample.rdd.map(customFunction)

sample2 = sample.rdd.map(lambda x: (x.name, x.age, x.city))

The custom function would then be applied to every row of the dataframe. Note that sample2 will be a RDD, not a dataframe.

Map may be needed if you are going to perform more complex computations. If you just need to add a simple derived column, you can use the withColumn, with returns a dataframe.

sample3 = sample.withColumn('age2', sample.age + 2)

answered Sep 20 '22 21:09

David

Related questions
                            
                                collect_list by preserving order based on another variable
                            
                                Apache Spark vs Akka [closed]
                            
                                Why is "Unable to find encoder for type stored in a Dataset" when creating a dataset of custom case class?
                            
                                Add an empty column to Spark DataFrame
                            
                                How DAG works under the covers in RDD?
                            
                                Spark Driver in Apache spark
                            
                                Converting Pandas dataframe into Spark dataframe error
                            
                                How to avoid duplicate columns after join?
                            
                                Why does join fail with "java.util.concurrent.TimeoutException: Futures timed out after [300 seconds]"?
                            
                                Filter df when values matches part of a string in pyspark
                            
                                Apache Spark logging within Scala
                            
                                Provide schema while reading csv file as a dataframe
                            
                                reduceByKey: How does it work internally?
                            
                                Write to multiple outputs by key Spark - one Spark job
                            
                                Spark - SELECT WHERE or filtering?
                            
                                What does setMaster `local[*]` mean in spark?
                            
                                How to perform union on two DataFrames with different amounts of columns in spark?
                            
                                Errors when using OFF_HEAP Storage with Spark 1.4.0 and Tachyon 0.6.4
                            
                                How to check the Spark version
                            
                                How do I skip a header from CSV files in Spark?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With